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OLIGOPROBE DESIGNSTATION: A COMPUTERIZED METHOD 
FOR DESIGNING OPTIMAL OLIGONUCLEOTIDE PROBES AND PRIMERS 

A portion of the disclosure of this patent document contains material 
which is subject to copyright protection. The copyright owner has no 
objection to the facsimile reproduction by anyone of the patent disclosure, 
as it appears in the Patent and Trademark Office patent files or records, 
but otherwise reserves all copyright rights whatsoever. 

BACKGROUND OF THE INVENTION 
This invention relates to the fields of genetic engineering, microbiology, and 
computer science, and more specifically to an invention that helps the user, whether 
they be a molecular biologist or a clinical diagnostician, to calculate and design 
extremely accurate oligonucleotide sequences for use as probes, for example for DNA 
and mRNA hybridization procedures, or as primers, for example for DNA amplification 
and extension using the polymerase chain reaction (PCR). In the following description, 
the design of probes has been discussed. 

The oligonucleotide probes designed with this invention may be used to test for 
the presence of precursors of specific proteins in living tissues, or may be used for 
medical diagnostic kits, DNA identification, and potentially continuous monitoring of 
metabolic processes inhuman beings. The present implementation of this computerized * 
design tool runs under Microsoft ® Windows ~ v. 3.1 (made by Microsoft Corporation of 
Redmond, Washington) on IBM ® compatible personal computers (PC's). 

Unless defined otherwise, all technical and scientific terms used herein have the 
same meaning as commonly understood by one of ordinary skill in the art to which this 
invention belongs. Although any methods and materials similar or equivalent to those 
described herein can be used in the practice or testing of the present invention, the 
preferred methods and materials are now described. All publications mentioned 
hereunder are incorporated herein by reference. 

To isolate a specific gene for any particular purpose, a researcher first has to 
have some idea of what he or she is looking for. To do this, the researcher needs to 
t have a probe, which acts like a molecular hook that can identify and latch onto (i.e., 

bind to or hybridize with) the desired gene in a crowd of many other genes. A 
researcher who can obtain an entire strand of mRNA can eventually find the gene from 
which it was copied, using complementary DNA (cDNA, which is a cloned equivalent 
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to RNA and somewhat equivalent to mRNA) as a probe to search through the great 
mass of genetic material and locate the desired original gene. cDNA essentially is 
manufactured or non-naturally occurring DNA from which all of the nonessential DNA 
has been removed. cDNA allows the researcher to concentrate entirely on the 
important portions of the gene being examined. The nonessential DNA regions are 
easy to recognize because when the gene is translated into protein, these regions do not 
wind up reflected in the protein sequence. These regions are called introns, or 
intervening regions. mRNA has no introns because they have been "spliced" out of the 
mRNA before translation. Thus, mRNA and cDNA contain only the essential 
information from a gene (called the exons). cDNA is the equivalent of mRNA with a 
complementary sequence, only the exons are present. cDNA may be produced by 
reverse transcription of mRNA. 

The procedure of using cDNA from known mRNA as a probe to search through 
genetic material and locate the original gene is called molecular hybridization, and is 
currently one method of identifying specific genes. However, this method is less than 
perfect, can be extremely time consuming, and often is not even feasible because the 
researcher actually has to have an entire strand of cDNA from the desired gene before 
he or she can attempt to use this cDNA to locate and identify the particular gene. 
Thus, it is something of a circular problem. If the researcher cannot obtain an entire 
strand of mRNA or cDNA from the desired gene, then he or she must somehow design 
a probe from scratch to be used to identify that gene. 

Oligonucleotide probes (that is, probes made up of a small number of 
nucleotides, such as 17 to 100), are increasingly being used to identify specific genes 
from genomic or cDNA libraries when the partial amino acid sequences is known, (von 
Heijne 1987, Ref. 15). This is a second method of determining a proper probe. 
Although the present implementation of this invention does not deal with cases in which 
the proteins have been sequenced, but rather only the DNA or mRNA, it is possible 
that this invention or a future implementation of it might be used with protein 
sequences. Such probes can also be used as primers which, when annealed to mRNAs, 
can be selectively extended into cDNAs. (von Heijne 1987, Ref. 15). 

Because of these situations, the problem that the researcher faces is to discover 
or design a probe or mixture of probes that maximizes the researchers chances of 
successful hybridization while at the same time minimizing the amount of time and 
money that has to be spent on discovering or designing the probes, (von Heijne 1987, 
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Ref. 15), Researchers in the field have determined that computer analysis can greatly 
expedite and simplify the search for optimal probe sequences, (von Heijne 1987, Ref. 
15). However, all of the search strategies known to the present inventors are time 
consuming (both CPU and user time) and may be somewhat inaccurate. As stated in 
von Heijne, "a true optimization of the probe in terms not only of degeneracy but in 
terms of length, codon usage, Guanine-Cytosine (GC) avoidance, and expected signal- 
to-noise ratio (hybridization to target over background) is a fairly complex problem, 
however, and does not seem to have been automated so far," (von Heijne 1987, Ref. 
15). Various search strategies known and used in the field to identify and design probes 
are outlined in the following sources: Lewis (1986, Ref. 9), Raupach (1984, Ref. 11), 
Yang et al . (1984, Ref. 16), and Martin and Castro (1984, Ref. 10). 

In the simplest version of a protein-related search strategy, the search procedure 
is limited to finding a set of probes of given lengths with the least possible degeneracy 
simply by scanning the amino acid sequence and noting the number of alternative 
codons in the corresponding oligonucleotide as the scan moves along the chain of 
nucleotides. (Lewis 1986). The researcher can also include codon usage statistics 
(because more than one codon can translate to the same amino acid), which would 
attach a probability-of-occurrence value to each probe. (Raupach 1984, Ref. 11). 

A more advanced algorithm would allow the researcher to specify the way in 
which he or she plans to synthesize the probes (for example, by adding monomers or 
mixtures of monomers). It would also be easy for a researcher to add a rough estimate 
of the disassociation (or melting) temperatures of each probe to a program such as this. 

One way to solve the problem of finding local similarities between two proteins 
being compared that has been discussed in the relevant literature is to use list-sorting 
or hashing routines, (von Heijne 1987, Ref. 15). These routines are based on the 
construction of a list or lookup table of k-letter words or k-tuples (i.e., all possible di- 
or trinucleotides), and the positions where they appear in the sequences being 
compared. This method is employed in some of the most extensively used "fast search" 
programs (see examples identified in von Heijne 1987, Ref. 15). 

Two general methods of designing probes are common in the field, depending 
upon whether the researcher is trying to design a common probe or a specific probe. 
Common probes attempt to find common or consensus sequences among various species 
and among family genes. The first step in designing such a probe is to find the genes 
of interest. This may be done by performing a keyword or homology search against the 
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GenBank (a genome database available from IntelliGenics of Mountain View, CA) or 
a keyword search against MEDLINE (the database currently available from the U.S. 
National Library of Medicine under the data access system known as Dialog of Dialog 
Information Service, Inc., Palo Alto, CA) or by performing a homology analysis between 
one of the genes of interest and whole GenBank sequences. The next step is to retrieve 
all of the relevant genes of interest. In the third step, multiple alignment analysis can 
be done using a commercially available software package such as DNASIS (from Hitachi 
Software of Brisbane, California), which is an autoconnect program. In this step, the 
computer identifies which nucleotides are common among the requested sequences: 
Al AGGCCTCGGTTAGTTGGCCGTTGCCGAAAAA 

A2 AGGCGTCGGTTATTTGGGCCTTCCCAATGTG 

A3 AGGCGTCGGTTCTGTGGAACTTCCCGAGGAA 

* * * * ****** * * * * * * * * 

* — common among Al, A2, and A3 

Alternatively, after homology analyses between two sequences are carried out, data from 
the multiple homology analyses can be combined. The researcher then manually has to 
find the common or consensus region: 

Al AGGCCTCGGTTAGTTGGCCGTTGCCGAAAAA 
A2 AGGCGTCGGTTATTTGGGCCTTCCCAATGTG 
A2 AGGCGTCGGTTATTTGGGCCTTCCCAATGTG 

A3 AGGCGTCGGTTCTGTGGAACTTCCCGAGGAA 

* * * * ****** *** ** ** * 

* — common among Al, A2, and A3 

Next, the researcher would input the sequence of the common region into the 
program and then analyze the secondary structure (i.e., the stacking site and the hairpin 
structure). After this, the researcher manually would select several candidate probes 
(from five to ten) which contain the minimal hairpin structure and specific length u 
according to the user's interest. A hairpin is an area in which a probe has "folded back" 
and one portion of the probe has hybridized with another portion of the same probe. 
The researcher would then perform a homology analysis between each candidate probe 
and all sequences in the GenBank to find all possible cross-hybridizable genes. Lastly, 
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which probe is highly homologous among the group of interest, but quite different from 
other unrelated sequences in the GenBank. 

The conventional methods for designing common oligonucleotide probes using 
currently available computer software have at least five problems: (1) they involve time 
consuming multiple processes; (2) it is difficult to control a significant variable, the 
melting temperature Tm of the oligonucleotide probes; (3) the methods do not recognize 
exons and introns and differentiate (thereby making it possible to have a designed probe 
that is identical to unrelated mRNA sequences); (4) the methods may miss short pieces 
of identical sequences; and (5) it is difficult to recognize multiple pieces of identical 
sequences in the gene. 

The second method of designing probes that is common in the field involves 
designing specific probes. Specific probes attempt to find unique sequences among 
various species and among family genes and among published sequences in the 
GenBank. A specific probe is a probe that hybridizes with only one particular gene, 
thereby identifying the presence of that gene for the researcher. The procedure involves 
first finding the genes of interest (by performing a keyword search against the GenBank 
or against MEDLINE) and then retrieving all of the relevant genes of interest. A 
manual homology analysis between the gene of interest and whole sequences in the 
GenBank can be performed to find common and unique regions. 

Al AGGCCTCGGTTAGTTGGCCGTTGCCGAAAAA 
Bl AGGCGTCGGTTATTGTGGTCTCCCCAATGTG 

common unique 
Next, the researcher would input the sequence of the unique region into the 
program and then analyze the secondary structure. After this, the researcher would 
manually select several candidate probes which contain the minimal hairpin structure 
and specific length according to the user's interest. The researcher would then perform 
a homology analysis between each candidate probe and all sequences in the GenBank 
to find all possible cross-hybridizable genes. Lastly, the researcher manually would 
decide which is the best candidate probe by determining which probe does not have 
identical sequences in unrelated sequences in the GenBank. 
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All of the conventional methods for designing specific oligonucleotide probes 
known to the inventors using currently available computer software have at least four 
problems: (1) they involve time consuming multiple processes; (2) it is difficult to 
control the melting temperature Tm of the oligonucleotide probes; (3) the methods do 
not allow for quantification of uniqueness; and (4) there is no guarantee that the method 
will design the best possible probe. 

None of the methods discussed in the literature discloses a system that may be 
used to design both common probes and extremely specific probes, especially a method 
that minimizes user and CPU time and is exceptionally accurate. 

Programs currently used for rapid database similarity searches use either hashing 
strategies or statistical strategies. The hashing strategy is now being used for the 
detection of relatively short regions of similarity, while the statistical strategy is now 
being used for the detection of weaker and longer similarity regions. The Mismatch 
Model of this invention can be used for very strong similarity searches with running 
times faster than current hashing strategies. 

The basic technologies behind the Mismatch Model used in this invention are 
hashing and continuous seed filtration, each general technology being known in the 
public domain and having been previously applied separately to non-genetic applications. 
To the best of the inventors' knowledge, these methods, used together, have never been 
suggested in other studies on optimal probe selection. The inventors' methods have a 
program performance of tens of seconds (CPU + I/O time) with a 1000 nucleotide 
query and all mammalian DNA on a SPARC station, and are even faster on the more 
common personal computer proposed herein. 

The H-Site Model of this invention likewise is unique in that it offers a multitude 
of information on selected probes and original and distinctive means of visualizing, 
analyzing and selecting among candidate probes designed with the invention. Candidate 
probes are analyzed using the H-Site Model for their binding specificity relative to some 
known set of mRNA or DNA sequences, collected in a database such as the GenBank 
database. The first step involves selection of candidate probes at some or all the 
positions along a given target. Next, a melting temperature model is selected, and an 
accounting is made of how many false hybridizations each candidate probe will produce 
and what the melting temperature of each will be. Lastly, the results are presented to 
the researcher along with a unique set of tools for visualizing, analyzing and selecting 
among the candidate probes. 
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This invention is both much faster and much more accurate than the methods 
that are currently in use. It is unique because it is the only method that can find not 
only the most specific and unique sequence, but also the common sequences. Further, 
it allows the user to perform many types of analysis on the candidate probes, in addition 
to comparing those probes in various ways to the target sequences and to each other. 

Therefore, it is the object of this invention to provide a practical and user- 
friendly system that will allow a researcher to design both specific and common 
oligonucleotide probes, and to do this in less time and with much more accuracy than 
currently done. For example, the current version of the GenBank contains over ninety 
(90) million nucleotides. It is thought that the human genome alone consists of three 
billion base pairs, and scientists have so far managed to decode the base sequence of 
only about 500 human genes, less than one percent of the total. Currently available 
searching strategies are limited in how many of the GenBank' s sequences can be 
accessed and successfully searched, and how convenient and feasible such a search would 
be (in terms of both computer processor and human user time). It is also an object of 
this invention to allow the user to be able to run the program on more readily available 
and far less expensive computer hardware (i.e., a PC rather than a mainframe). This 
invention will remove those limits and allow genetic research to take a giant leap 
forward. 

These and other advantages and objects of this invention will become apparent 
from the following detailed descriptions, drawings, and appended claims. 
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BRIEF DESCRIPTION OF THE INVENTION 

There is disclosed herein a system which allows the user to calculate and design 
extremely accurate oligonucleotide probes for DNA and mRNA hybridization 
procedures. The invention runs under Microsoft ® Windows on IBM @ compatible 
personal computers (PC's). Its key features design oligonucleotide probes based on the 
GenBank database of DNA and mRNA sequences and examine probes for specificity 
or commonality with respect to a user-selected experimental preparation of gene 
sequences. Hybridization strength between a probe and a subsequence of DNA or 
mRNA can be estimated through a hybridization strength model. Quantitatively, 
hybridization strength is given as the melting temperature Tm. Currently, two 
hybridization strength models are supported by this invention: 1) the Mismatch Model 
and 2) the H-Site Model. The user is allowed to select from the following calculations 
for each probe, results of which are available for display and analysis: 1) Sequence, 
Melting Temperature (Tm) and Hairpin characteristics; 2) Hybridization with other 
species within the preparation mixture; and (3) Location and Tm for the strongest 
hybridizations. The results of the invention's calculations are then displayed on the 
Mitsuhashi Probe Selection Diagram (MPSD), which is a graphic display of all of the 
hybridizations of probes for the target mRNA with all sequences in the preparation. 

The Main Dialog Window of the present implementation of this invention 
controls all user-definable settings. The user is offered a number of options at this 
window. The File option allows the user to print, print in color, save selected probes, 
and exit the program. The Preparation option allows the user to open and create 
preparation (PRP) files. The Models option allows the user to chose between the two 
hybridization models currently supported by the invention: 1) the H-Site Model and 2) 
the Mismatch Model. If the user selects the H-Site Model option, the user normally sets 
the following model parameters: 1) the melting temperature Tm for which probes are 
being designed (i.e., the melting temperature that corresponds to a particular experiment 
or condition the user desires to simulate); and 2) the nucleation threshold, which is the 
number of base pairs constituting a nucleation site. If the user selects the Mismatch 
Model option, the user normally sets the following model parameters: 1) probe length, 
which is the number of bases in probes to be considered; and 2) mismatch N, which is 
the maximum number of mismatches constituting a hybridization. 

The Mismatch Model program is used to design DNA and mRNA probes, 
utilizing sequence database information from sources such as GenBank and other 
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databases with similar file formats. In the Mismatch Model, hybridization strength is 
related only to the number of base pair mismatches between a probe and its binding 
site. Generally, the more mismatches a user allows, the more probes will be found. The 
Mismatch Model does not take into account the Guanine-Cytosine (GC) content of 
candidate probes, as does the H-Site Model, discussed below, so there is no reflection 
or indication of the probe's binding strength. The basic technologies employed by this 
model are hashing and continuous seed filtration. Hashing involves the application of 
an algorithm or process to the records in a set of data to obtain a symmetric grouping 
of the records. When using an indexed set of data, hashing is the process of 
transforming a record key to an index value for storing and retrieving a record. 
Rosenberg (1984, Ref. 12)). The concept of continuous seed filtration is discussed in 
detail below. 

The essence of the Mismatch Model is a fast process for doing exact and inexact 
matching between DNA and mRNA sequences to support the Mitsuhashi Probe 
Selection Diagram (MPSD) and other types of analysis discussed above. The process 
used by the Mismatch Model is the Waterman-Pevzner Algorithm (the WPALG, which 
is named for two of the inventors), which is a computer-based probe selection process. 
Essentially, this is a combination of new and improved pattern matching processes. See 
Hume and Sunday (1991, Ref. 4), Landau eL_a]_ (1986-1990, Refs. 6, 7, 8), Grossi and 
Luccio (1989, Ref. 3), and Ukkonen (1982, Ref. 14). 

There are three principal programs that make up the Mismatch Model in this 
implementation of the invention. The first is designated by the inventors as "k_diff." 
WPALG is used in k_diff to find all locations of matches of length greater than or equal 
to one (1) (length is user-specified) with less than or equal to k number of mismatches 
(k is also user-specified) between the two sequences. If a candidate oligonucleotide 
probe fails to match that well, it is considered unique. k_diff uses hashing and 
continuous seed filtration, and looks for homologs in GenBank and other databases with 
similar file formats. The technique of continuous seed filtration allows for much more 
efficient searching than previously implemented techniques. A seed is defined in this 
invention to be a subsequence of length equal to the longest exact match in the worst 
case scenario. For example, suppose the user selects a probe length (1) of 18, with 2 or 
fewer mismatches (k). If a match exists with 2 mismatches, then there must be a 
perfectly matching subsequence of length equal to 6. Once the seed length has been 
determined, the Mismatch Model looks at all substrings of that seed length (in this 
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example, that seed length would be 6), finds the perfectly matched base pair 
subsequence of length equals 6, and then looks to see if this subsequence extends to a 
sequence of length equal to the user selected probe length (i.e. ,20 in this example). If 
so, a candidate probe has been found that meets the user's criteria. 

Where the seed size is large, the program allocates a relatively large amount of 
memory for the hash table. This invention has an option that allows memory allocation 
for GenBank entries just once at the beginning of the program, instead of reallocating 
memory for each GenBank entry. This reduces input time for GenBank entries by as 
much as a factor of two (2), but the user needs to know the maximum GenBank entry 
size in advance to do this. 

A probe is defined to hybridize if it has k or fewer mismatches in comparison 
with a target sequence from the database or file searched. Otherwise, it is non- 
hybridizing. The hit extension time for all appropriate parameters of the Mismatch 
Model has been found by experimentation to be less than thirty-five (35) seconds, except 
in one case where the minimum probe length (1) was set to 24 and the maximum 
number of mismatches (k) was set to four (4), which is a situation that is never used in 
real gene localization experiments because the hybridization conditions are too weak. 

In this invention, the second hybridization strength model is termed the H-Site 
Model. One aspect of the H-Site Model uses a generalization of an experimental 
formula in general usage. The basic formula on which this aspect of the model is built 
is as follows: 

Tm = 81.5 - 16.6(log[Na]) - .63 %(formamide) + .41 (%(G + Q) - 600 
/ N 

In this formula, log[Na] is the sodium concentration, %(G + C) is the fraction of 
matched base pairs which are G-C complementary, and N is the probe length. In other 
words, this formula is an expression of the fact that melting temperature Tm is a 
function of both probe length and percent of Guanine-Cytosine (GC) content. This 
basic formula has been modified in this invention to account for the presence of 
mismatches. Each percent of mismatch reduces the melting temperature Tm by an 
average of 1.25 degrees (2 degrees C for an Adenine-Thymine mismatch, and 4 degrees 
C for a Guanine-Cytosine mismatch). This formula is, however, an approximation. The 
actual melting temperature might differ significantly from this approximation, especially 
for short probes or for probes with a relatively large number of mismatches. 
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Hybridization strength in the H-Site Model is related to each of the following 
factors: 1) "binding region"; 2) type of mismatch (GC or AT substitution); 3) length of 
the probe; 4) GC content of the binding region (since GC pairs have a stronger bond 
than AT pairs, thus requiring a higher melting temperature); and 5) existence of a 
"nucleation site" (an exactly matching subsequence). The type of mismatch and the GC 
content of the binding region each contribute to a candidate probe's binding strength, 
which can be compared to other candidate probes' binding strengths to enable the user 
to select the optimal probe. 

The fundamental assumption of the H-Site Model is that binding strength is 
determined by a paired subsequence of the probe-species combination, called the 
binding region. If the binding region contains more GC pairs than AT pairs, the binding 
strength will be higher since the G and C bases (connected with three bonds) form a 
tighter bond than the A and T bases (connected with two bonds). Thus, G and C bases, 
and probes that are GC rich, require a higher melting temperature Tm and subsequently 
form a stronger bond. In the H-Site Model, and one of its unique features, the program 
designs optimal probes, ideally ones that do not have any mismatches, but if there are 
mismatches the H-Site Model takes these into account. With this model, a candidate 
probe can afford to have more mismatches involving the AT bases if there are more GC 
bases than AT bases in the probe. This is because this model looks primarily at regions 
of the candidate probe and target sequence that match and does not "penalize" the 
probe for areas that do not match. If the mismatches are located at either or both of 
the ends of the binding region, this has little effect. It is much more deleterious to have 
mismatches in the middle of the binding region, as this will significantly lower the 
binding strength of the probe. 

The formula cited above for Tm applies within the binding region. The length 
of the probe is used to calculate percentages, but all other parameters of the formula 
are applied to the binding region only. The H-Site Model further assumes the existence 
of a nucleation site, which is a region of exact match. The length of this nucleation site 
may be set by the user. Typically, a value of 8 to 10 base pairs is used. To complete 
the H-Site Model, the binding region is chosen so as to maximize the melting 
temperature Tm among all regions containing a nucleation site, assuming one exists 
(otherwise, Tm=0), 

The H-Site Model is more complex than the Mismatch Model discussed above 
in that hybridization strength is modeled as a sum of signed contributions, with matches 
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generally providing positive binding energy and mismatches generally providing negative 
binding energy. The exact coefficients to be used depend only on the matched or 
mismatched pair. These coefficients may be specified by the user, although in the 
current version of this invention these coefficients are not explicitly user-selectable, but 
rather are selected to best fit the hybridization strength formulas developed by Itakura 
et al (1984, Ref. 5), Bolton and McCarthy (1962, Ref. 2), Benner et al (1973, Ref. 1), 
and Southern (1975, Ref. 13). 

A unique aspect of the H-Site Model is that hybridization strength is defined to 
be determined by whatever the optimal binding region between the candidate probe and 
binding locus. This binding region is called the hybridization site, or h-site, and is 
selected so as to maximize overall hybridization strength, so that mismatches outside the 
binding region do not detract from the estimated hybridization strength. Several other 
unique features of the H-Site Model include the fact that it is more oriented toward 
RNA and especially cDNA sequences than DNA sequences, and the fact that the user 
has control over preparation and environmental variables. The first feature allows the 
user to concentrate on "meaningful" sequences, rather than having to sort through all of 
a DNA sequence (including the introns). The second feature allows the user to more 
accurately simulate laboratory conditions and more closely correspond with any 
experiments he or she is conducting. Further, this implementation of the invention does 
some preliminary preprocessing of the GenBank database to sort out and select the 
cDNA sequences. This is done by locating a keyword (in this case CDS) in each 
GenBank record, thereby eliminating any sequences containing introns. 

The Mitsuhashi Probe Selection Diagram (MPSD), FIG. 4, is the third key 
feature of this invention, as it is a unique way of visualizing the results of the probe 
designing performed by the Mismatch and H-Site Models. It is a graphic display of all 
of the hybridizations of candidate oligonucleotide probes for the target mRNA with all 
sequences in the preparation. Given a gene sequence database and a target mRNA 
sequence, the MPSD graphically displays all of the candidate probes and their 
hybridization strengths with all sequences from the database. In the present 
implementation, each melting temperature Tm is displayed as a different color, from red 
(highest Tm) to blue (lowest Tm). The MPSD allows the user to see visually the 
number of false hybridizations at various temperatures for all candidate probes, and the 
sources of these false hybridizations (with a loci and sequence comparison). A locus 
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may be a specific site or place, or, in the genetic sense, a locus is any of the homologous 
parts of a pair of chromosomes that may be occupied by allelic genes. 
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BRIEF DESCRIPTION OF THE DRAWING 

This invention may be more clearly understood from the following detailed 
description and by reference to the drawing in which: 

FIG. 1 is a simplified block diagram of a computer system illustrating the overall 
design of this invention; 

FIG. 2 is a display screen representation of the main dialog window of this 
invention; 

FIG. 3 is a flow chart of the overall invention illustrating the program, and the 
invention's sequence and structure; 

FIG. 4 is a display screen representation of the Mitsuhashi probe selection 
diagram; 

FIG. 5 is a display screen representation of the probeinfo and matchinfo window; 
FIG. 6 is a display screen representation of the probesedit window; 
FIG. 6a is a printout of the probesedit output file; 

FIG. 7 is a flow chart of the overall k diff program of the Mismatch Model of 
this invention, including its sequence and structure; 

FIG. 8 is a flow chart of the k_diff module of this invention; 

FIG. 9 is a flow chart of the hashing module of this invention; 

FIG. 10 is a flow chart of the tran module of this invention; 

FIG. 11 is a flow chart of the let_dig module of this invention; 

FIG. 12 is a flow chart of the update module of this invention; 

FIG. 13 is a flow chart of the assembly module of this invention; 

FIG. 14 is a flow chart of the seqload module of this invention; 

FIG. 15 is a flow chart of the readl module of this invention; 

FIG. 16 is a flow chart of the dig let module of this invention; 

FIG. 17 is a flow chart of the q_colour module of this invention; 

FIG. 18 is a flow chart of the hit_ext module of this invention; 

FIG. 19 is a flow chart of the colour module of this invention; 

FIG. 20 is a printout of a sample file containing the output of the Mismatch 
Model program of this invention; 

FIG. 21 is a flow chart of the H-Site Model, stage I, covering the creation of a 
preprocessed preparation file of this invention; 

FIG. 22 is a flow chart of the H-Site Model, stage II, covering the preparation 
of the target sequence(s); 
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FIG. 23 is a flow chart of the H-Site Model, stage III, covering the calculation 
of MPSD data; 

FIG. 24a is a printout of a sample file containing output of the Mismatch Model 
program; 

FIG. 24b is a printout of a sample file containing output of the H-Site Model 
program; 

FIG. 25 is a flow chart of the processing used to create the Mitsuhashi probe 
selection diagram (MPSD); 

FIG. 26 is a flow chart of processing used to create the matchinfo window; 
FIG. 27 is a printout of a sample target species file; 
FIG. 28 is a printout of a sample preparation file. 
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DETAILED DESCRIPTION OF THE INVENTION 

This invention is employed in the form best seen in FIG. 1. There, the 
combination of this invention consists of an IBM ® compatible personal computer (PC), 
running software specific to this invention, and having access to a distributed database 
with the file formats found in the GenBank database and other related databases. 

The preferred computer hardware capable of operating this invention involves of 
a system with at least the following specifications (FIG. 1): 1) an IBM ® compatible PC, 
generally designated 1A, IB, and 1C, with an 80486 coprocessor, running at 33 Mhz or 
faster; 2) 8 or more MB of RAM, 1A; 3) a hard disk IB with at least 200 MB of storage 
space, but preferably 1 GB; 4) a VGA color monitor 1C with graphics capabilities of a 
size sufficient to display the invention's output in readable format, preferably with a 
resolution of 1024 x 768; and 5) a 580 MB CD ROM drive 5 (IB of FIG. 1 generally 
refers to the internal storage systems included in this PC, clockwise from upper right, 
two floppy drives, and a hard disk). Because the software of this invention preferably 
has a Microsoft ® Windows m interface, the user will also need a mouse 2, or some other 
type of pointing device. 

The preferred embodiment of this invention would also include a laser printer 3 
and/or a color plotter 4. The invention may also require a modem (which can be 
internal or external) if the user does not have access to the CD ROM versions of the 
GenBank database 8 (containing a variable number of gene sequences 6). If a modem 
is used, information and instructions are transmitted via telephone lines to and from the 
GenBank database 8. If a CD ROM drive 5 is used, the GenBank database (or specific 
portions of it) is stored on a number of CDs. 

The computer system should have at least the Microsoft ® DOS v. 5.0 operating 
system running Microsoft ® Windows ™ v. 3.1. All of the programs in the preferred 
embodiment of the invention are written in the Borland ® C + 4- (made by Borland 
International, Inc., of Scotts Valley, C A) computer language. It must be recognized that 
subsequently developed computers, storage systems, and languages may be adapted to 
utilize this invention and vice versa. 

This invention is designed to enable the user to access DNA, mRNA and cDNA 
sequences stored either in the GenBank or in databases with similar file formats. 
GenBank is a distributed flat file database made up of records, each record containing 
a variable number of fields in ASCII file format. The stored database itself is 
distributed, and there is no one database management system (DBMS) common to even 
a majority of its users. One general format, called the line type format, is used both for 
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the distributed database and for all of GenBank's internal record keeping. All data and 
system files and indexes for GenBank are kept in text files in this line type format. 

The primary GenBank database is currently distributed in a multitude of files or 
divisions, each of which represents the genome of a particular species (or at least as 
much of it as is currently known and sequenced and publicly available). The GenBank 
provides a collection of nucleotide sequences as well as relevant bibliographic and 
biological annotation. Release 72.0 (6/92) of the GenBank CD distribution contains 
over 71,000 loci with a total of over ninety-two (92) million nucleotides. GenBank is 
distributed by IntelliGenetics, of Mountain View, CA, in cooperation with the National 
Center for Biotechnology Information, National Library of Medecinge, inBethesda, MD. 

1. Overall Description of the Invention 

a. General Theory 

The intent of this invention is to provide one or more fast processes for 
performing exact and inexact matching between DNA sequences to support the 
Mitsuhashi Probe Selection Diagram (MPSD), discussed below, and other analysis with 
interactive graphical analysis tools. Hybridization strength between a candidate 
oligonucleotide probe and a subsequence of DNA, mRNA or cDNA can be estimated 
through a hybridization strength model. Quantitatively, hybridization strength is given 
as the melting temperature Tm. Currently, two hybridization strength models are 
supported by the invention: 1) the Mismatch Model and 2) the H-Site Model. 

b. Inputs 

i. Main Dialog Window 

The Main Dialog Window, FIG. 2, controls all user-definable settings. This 
window has a menu bar offering five options: 1) File 10; 2) Preparation 20; 3) Models 
30; 4) Experiment 40; and 5) Help 50. The File 10 option allows the user to print, print 
in color, save selected probes, and exit the program. The Preparation 30 option allows 
the user to open and create preparation (PRP) files. 

The Models 20 option allows the user to chose between the two hybridization 
models currently supported by the invention: 1) the H-Site Model 21 and 2) the 
Mismatch Model 25. If the user selects the H-Site Model 21 option, the left hand menu 
of FIG. 2C is displayed and the user sets the following model parameters: 1) the 
melting temperature Tm 22 for which probes are being designed (i.e., the melting 
temperature that corresponds to a particular experiment or condition the user desires 



CGK00035647 



WO 94/11837 



PCT/US93/10507 



to simulate); and 2) the nucleation threshold 23, which is the number of base pairs 
constituting a nucleation site. If the user selects the Mismatch Model 25 option, the 
right hand menu of FIG. 2C is displayed and the user sets the following model 
parameters: 1) probe length 26, which is the number of base pairs in probes to be 
considered; and 2) mismatch N 27, which is the maximum number of mismatches 
constituting a hybridization. Computation of the user's request will take longer with the 
H-Site Model if the threshold 23 setting is decreased and with the Mismatch Model if 
the number of mismatches K 27 is increased. 

In addition, for both Model options the user chooses the target species 1 1 DNA 
or mRNA for which probes are being designed and the preparation 12, a file of all 
sequences with which hybridizations are to be calculated. A sample of a target species 
file is shown in FIG. 27 (humbjunx.cds), while a sample of a preparation file is shown 
in FIG. 28 (junmix.seq). Each of these inputs is represented by a file name and 
extension in general DOS format. In the target species and preparation fields, the file 
format follows the GenBank format, and each of the fields includes a default file 
extension. Pressing the "OK" button 41 of FIG. 2C will cause the processing to begin, 
and pressing the "Cancel" button 43 will cause it to stop. 

The Experiment 40 option and the Help 50 option are expansion options not yet 
available in the current implementation of the invention. " 

c. Processing 

FIG. 3 is a flow chart of the overall program, illustrating its sequence and 
structure. Generally, the main or " control" program of the invention basically performs 
overall maintenance and control functions. This program, as illustrated in FIG. 3, 
accomplishes the general housekeeping functions 51, such as defining global variables. 
The user-friendly interface 53, carries out the user-input procedures 55, the file 57 or 
database 59 access procedures, calling of the model program 62 or 63 selected by the 
user, and the user-selected report 65 or display 67, 69, 71 and 73 features. Each of 
these features is discussed in more detail in later sections, with the exception of the 
input procedures, which involves capturing the user's set-up and control inputs. 

d. Outputs 

i. The Mitsuhashi Probe Selection Diagram Window 
The Mitsuhashi Probe Selection Diagram (MPSD), FIG. 4, is a key feature of the 
invention as it is a unique way of visualizing the results of the program's calculations. 
It is a graphic display of all of the hybridizations of probes for the target mRNA with 



CGK00035648 



WO 94/11837 



PCI7US93/10507 



all sequences in the preparation. In other words, given a sequence database and a 
target mRNA, the MPSD graphically displays all of the candidate probes and their 
hybridization strengths with all sequences from the sequence database. The MPSD 
allows the user to see visually the number of false hybridizations at various temperatures 
for all candidate probes, and the sources of these false hybridizations (with a loci and 
sequence comparison) . 

For each melting temperature Tm of interest, a graphical representation of the 
number of hybridizations for each probe is displayed. In the preferred embodiment, this 
representation is color coded. In this implementation of the invention, the color red 123 
identifies the highest melting temperature Tm and the color blue 124 identifies the 
lowest melting temperature Tm. Each mismatch results in a reduction in Tm. Tm is 
also a function of probe length and percent content of GC bases. Within the window, 
the cursor 125 shape is changed from a vertical line bisecting the screen to a small 
rectangle when the user selects a particular probe. The current probe is defined to be 
that probe under the cursor position (whether it be a line or a rectangle) in the MPSD 
window. More detailed information about the current probe is given in the Probelnfo 
and Matchlnfo windows, discussed below. Clicking the mouse 2 once at the cursor 125 
selects the current probe. Clicking the mouse 2 a second time deselects the current 
probe. Moving the cursor across the screen causes the display to change to reflect the 
candidate probe under the current cursor position. 

The x-axis 110 of the MPSD, FIG. 4, shows the candidate probes' starting 
positions along the given mRNA sequence. The user may "slide" the display to the left 
or right in order to display other probe starting positions. The y-axis 115 of the MPSD 
displays the probe specificity, which is calculated by the program. 

The menu options 116, 117, 118, 119, and 120 available to the user while in the 
MPSD, FIG. 4, are displayed along a menu bar at the top of the screen. The user can 
click the mouse 2 on the preferred option to briefly display the option choices, or can 
click and hold the mouse button on the option to allow an option to be selected. The 
user may also type a combination of keystrokes in order to display an option in 
accordance with well-known computer desk top interface operations. This combination 
usually involves holding down the ALT key while pressing the key representing the first 
letter of the desired option (i.e, F, P, M, E or H). 

The File option 116 allows the user to specify input files and databases. The 
Preparation option 117 allows the user to create a preparation file summarizing the 
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sequence database. The Models option 118 allows the user to specify the hybridization 
model (i.e.,H-Site or Mismatch) and its parameters. The Experiment option 119 and 
the H_elp option 120 are not available in the current implementation of this invention. 
These options are part of the original Main Dialog Window, FIG. 2. 

Areas on the graphical display of the MPSD, FIG. 4, where the hybridizations for 
the optimal probes are displayed are lowest and most similar, such as shown at 121, 
indicate that the particular sequence displayed is common to all sequences. Areas on 
the graphical display of the MPSD where the hybridizations for the optimal probes are 
displayed are highest and most dissimilar, such as shown at 122, indicate that the 
particular sequence displayed is extremely specific to that particular gene fragment. The 
high points on the MPSD show many loci in the database, to which the candidate probe 
will hybridize (i.e., many false hybridizations). The low points show few hybridizations, 
at least relative to the given database. In other words, the sequence shown at 121 would 
reflect a probe common to all of the gene fragments tested, such that this probe could 
be used to detect each of these genes. The sequence shown at 122 would reflect a 
probe specific to the particular gene fragment, such that this probe could be used to 
detect this particular gene and no others. 

ii. The Probelnfo and Matchlnfo Window 

The combined Probelnfo and Matchlnfo Window, FIG. 5, displays detailed 
information about the current candidate probe. The upper portion of the window is the 
Probelnfo window, and the lower portion is the Matchlnfo window. The Probelnfo 
window portion displays the following types of information: the target locus (i.e., the 
mRNA, cDNA } or DNA from which the user is looking for probes) is displayed at 131, 
while the preparation used for hybridizations is displayed at 132. In the example shown 
in FIG, 5, the target locus 131 is the file named HUMBJUNX.CDS, which is shown as 
being located on drive F in the subdirectory MILAN. The preparation 132 is shown as 
being the file designated JUNMIX.PRP, which is also shown as being located on drive 
F in the subdirectory MILAN. The JUNMIX.PRP preparation in this example is a 
mixture of human and mouse jun loci. 

The current and optimal probe's starting position is shown at 135. The current 
candidate oligonucleotide probe is defined at 136, and is listed at 137 as having a length 
of 21 bases. The melting temperature for the probe 136 as hybridized with the targets 
is shown in column 140. The melting temperature for the optimal probe is given as 61.7 
degrees C at 138. The Probelnfo Window FIG. 5 also displays hairpin characteristics 
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of the probe at 139. In the example shown, the Probelnfo Window shows that there are 
four (4) base pairs involved in the worst hairpin, and that the worst hairpin has a length 
of one (1) (see FIG. 5, at 139). 

The Matchlnfo Window portion displays a list of hybridizations between the 
current probe and species within the preparation file, including hybridization loci and 
hybridization temperatures. The hybridizations are listed in descending order by melting 
temperature. The display shows the locus with which the hybridization occurs, the 
position within the locus, and the hybridization sequence. 

In the Matchlnfo window portion, the candidate probe 136 is shown at 150 as 
hybridizing completely with a high binding strength. This is because the target DNA is 
itself represented in the database in this case, so the candidate probe is seen at 150 to 
hybridize with itself (a perfect hybridization). The locus of each hybridization from the 
preparation 132 are displayed in column 141, while the starting position of each 
hybridization is given in column 142. The calculated hybridizations are shown at 145. 

iii. The ProbesEdit Window 

The ProbesEdit Window, FIG. 6, is a text editing window provided for convenient 
editing and annotation of the invention's text file output. It is also used to accumulate 
probes selected from the MPSD, FIG. 4, by mouse 2 clicks. Standard text editing 
capabilities are available within the ProbesEdit Window. The user may accumulate 
selected probes in this window (see 155 for an example) and then save them to a file 
(which will bear the name of the preparation sequence with the file extension of "prb" 
156, or may be another file name selected by the user). A sample of this file is shown 
in FIG. 6A. 

iv. Miscellaneous Output 

The present embodiment of this invention also creates two output files, currently 
named "test. out" and M test 1. out", depending upon which model the user has selected. 
The first file, "test. out", is created with both the Mismatch Model and the H-Site Model. 
This file is a textual representation of the Mitsuhashi Probe Selection Diagram (MPSD). 
It breaks the probe sequence down by position, length, delta Tm, screensN, and the 
actual probe sequence (i.e., nucleotides). An example of this file created by the 
Mismatch Model is shown in FIG. 20, and example created by the H-Site Model is 
shown in FIG. 24A. The second file, "testl.out", is created only by the H-Site Model. 
This file is a textual representation of the Probelnfo and Matchlnfo window that 
captures all hybridizations, along with their locus, starting position, melting temperature, 
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and possible other hybridizations. A partial example of this file is shown in FIG. 24B 
(10 pages out of a total of 190 pages created by the H-Site Model). 

2. Description of the Mismatch Model Program 

a. Overview 

In this invention, one of the hybridization strength models is termed the 
Mismatch Model (see FIG. 2 for selection of this model). The basic operation of this 
model involves the techniques of hashing and continuous seed filtration, as defined 
earlier and described in more detail below. The essence of the Mismatch Model is a fast 
process for doing exact and inexact matching between DNA and mRNA sequences to 
support the Mitsuhashi Probe Selection Diagram (MPSD). There are a number of 
modules in the present implementation of the Mismatch Model contained in this 
invention, the most significant of which are shown in the flow chart in FIG. 7 and in 
more detail in FIGS. 8 through 18. The main k_diff module shown in the flow chart in 
FIG. 8 is a structured program that provides overall control of the Mismatch Model, 
calling various submodules that perform different functions. 

b. Inputs 

The user-selected input variables for this model are minimum probe length 26 
(which is generally from 18 to 30) and maximum number' of mismatches 27 (which 
generally is from 1 to 5). These inputs are entered by the user in the Main Dialog 
Window, FIG. 2C. 

c. Processing 

i. k_diff Program 

Some terms of art need to be defined before the processing performed by this 
module can be explained. A hash table basically is an array or table of data. A linked 
list is a classical data structure which is a chain of linked entries and involves pointers 
to other entry structures. Entries in a linked list do not have to be stored sequentially 
in memory, as is the case with elements contained in an array. Usually there is a 
pointer to the list associated with the list, which is often initially set to point to the start 
of the list. A pointer to a list is useful for sequencing through the entries in the list. 
A null pointer (i.e., a pointer with a value of zero) is used to mark the end of the list. 

As the flow charts in FIGS. 7 and 8 illustrate, the general process steps and 
implemented functions of this model can be outlined as follows: 
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as 

Step 1: First, create a hash table and linked list from the query (FIG. 7, hashing 
module 222). 

Step 2: Next, while there are still GenBank entries available for searching (FIG. 

7, assembly module 230): 

Step 2a: Read the current GenBank entry (record) sequence of user- 
specified length (FIG. 7, seqload module 232), or read the current 
sequence (record) from the file selected by the user (FIG. 7, read 1 module 
234). 

Step 2b: For the current sequence for each position of the sequence from 
the first position (or nucleotide) to the last position (or nucleotide) 
(incrementing the position number once each iteration of the loop) (FIG. 
7, q_colour module 242), 

Step 2c: set the variable dna_hash equal to the hash of the current 
position of the current sequence (FIG. 7, q_colour module 242). 
Step 2d: While not at the end of the linked list for dnajiash (FIG. 
7, q_colour module 242), 

Step 2e: set the query __pos equal to the current position of 
dnajiash in the linked list (FIG. 7, q_colour module 242) 
and 

Step 2f: Extend the hit with the coordinates (query_pos, 
dna_pos) (FIG. 7, hit_ext module 244), 
Step 2g: If there exists a k_mismatch in the current 
extended hit (FIG. 7, colour module 246), then 

Step 2h: print the current hit (FIG. 7, q_colour 
module 242), and repeat from Step 2. 
As this illustrates, there are three (3) basic looping or iteration processes with 
functions being performed based on variables such as whether the GenBank section end 
has been reached (the first "WHILE" loop, Step 2), whether the end of the current DNA 
entry has been reached (the "FOR" loop, Step 2b), and whether the end of the dnajiash 
linked list has been reached (the second "WHILE" loop, Step 2d). A "hit" will only be 
printed if there are k__mismatches in the current extended hit. 

FIGS. 8 through 18 illustrate the functions of each of the modules of the present 
embodiment of this invention, all of which were generalized and summarized in the 
description above. FIG. 8, which outlines the main "k_difF module, shows that this 
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module is primarily a program organization and direction module, in addition to 
performing routine "housekeeping" functions, such as defining the variables and hash 
tables 251, checking if the user-selected gene sequence file is open 252, extracting 
needed identification information from the GenBank 253, and ensuring valid user input 
254. This module also performs a one-time allocation of memory for the gene 
sequences, and allocates memory for hit information, hashing, hybridization and 
frequency length profiles and output displays, 255 & 256. The "k_diff" module also 
initializes or "zeros out" the hashing table, the linked hashing list and the various other 
variables 257 in preparation for the hashing function. In addition, this module forms the 
hash tables 258 and extracts a sequence and finds the sequence length 259. 

One of the most important functions performed by the "kjdiff " module is to 
define the seed (or kernel or kjuple) size. This is done by setting the variable k_tuple 
equal to (min_probe_length -max_mismatch_#)/(max_mismatch-f # + 1) FIG. 8 at 265. 
Next, if the remainder of the aforementioned process is not equal to zero 266, then the 
value of the variable k_tuple is incremented by one 267. The resulting value is the size 
of the seed. The module then reads the query 268 and copies the LOCUS name 269 
for identification purposes (a definition of the term locus is given earlier in the 
specification). 

The "kdiff" module FIG. 8 also calls the "assembly" module 260, writes the 
results to a file 261a, plots the results 261b (discussed below), calculates the hairpin 
characteristics 262 (i.e., the number of base pairs and the length of the worst hairpin) 
and the melting temperature (Tm) for each candidate probe 263, and saves the results 
to a file 264. 

The screen graphs are plotted 261b by converting the result values to pixels, filing 
a pixel array and performing a binary search into the pixel array. Next, given the 
number of pixels per probe position and which function is of interest to the user (i.e., 
the three mismatch match numbers), the program interpolates the values at the value 
of (pixelsPerPositionN-1) and computes the array of pixel values for drawing the graph. 
These values are then plotted on the MPSD. 

The "hashing" module, FIG. 9, performs hashing of the query. In other words, 
it creates the hash table and linked list of query positions with the same hash. The 
variable has_table[i] equals the position of the first occurrence of hash i in the query. 
If i does not appear in the query, hash_table[i] is set to zero. 
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The "tran" module, FIG. 10, is called by the "hashing" module 271, and performs 
the hashing of the sequence of kjuple (kernel or seed) size. If the k_tuple exists (i.e., 
its length is greater than zero), the variable uns is set equal to uns*ALF+p 291. The 
variable p represents the digit returned by the "letjiig" module FIG. 11 that represents 
the nucleotide being examined. ALF is a constant that is set by the program in this 
implementation to equal four. The query pointer is then incremented, while the size of 
kjuple (the seed) is decremented 292. This process is repeated until the sequence of 
kjuple has been entirely hashed. Then the "tran" module returns the variable 
current_hash 293 to the "hashing" module FIG. 9. 

The "let_dig" module, FIG. 11, is called by the "tran" module 291, and transforms 
the nucleotides represented as the characters "A YTY'U","G"and "C'in the GenBank 
and the user's query into numeric digits for easier processing by the program. This 
module transforms "a" and "A" into "0"301, YV'T'W'and "ITinto 'T'302, M g"and "G" 
into "2"303,and V'and "C'into "3"305. If the character to be transformed does not 
match any one of those listed above, the module returns "-1" 305. The "hashing" 
module, FIG. 9, then calls the "update" module 272, FIG. 12, which updates the hash 
with a sliding window (i.e., it forms a new hash after shifting the old hash by "1"). The 
remainder of old_hash divided by powerj is calculated 311 (a modulus operation), the 
remainder is multiplied by ALF 312 (i.e., four), and then the digit representing the 
nucleotide is added to the result 313. The "update" module then returns the result 314 
to the "hashing" module FIG. 9. 

If the current hash has already occurred in the query, the program searches for 
the end of the linked list for the current hash 273 and marks the end of the linked list 
for the current hash 274. If the current hash has not already occurred in the query, the 
program puts the hash into the hash table 275. The resulting hash table and linked list 
are then returned to the "k_diff" module, FIG. 8 at 258. 

The "assembly" module, FIG. 13, extracts sequences from the GenBank and 
performs hit locating and extending functions. This module is called by the "k_diff" 
module FIG. 8 at 260 if the user has chosen to use the database to locate matches. The 
output from the "assembly" module (FIG. 13) tells the user that the section of the 
database searched contains E number of entries 321 of S summary length 322 with H 
number of hits 323. Further, the program tells the user that the number of considered 
1-tuples equals T 324. The entry head line is also printed 326. 
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The "seqload" module, FIG. 14, is called by the "k_d iff" module FIG. 8 at 259 
once the query hash table and linked list have been formed by the "hashing" module 
FIG. 9. The "seqload" module FIG. 14 checks to see if the end of the GenBank file has 
been reached 327, and, if not, searches until a record is found with LOCUS in the head- 
line 328. Next, the LOCUS name is extracted 329 for identification purposes, and the 
program searches for the ORIGIN field in the record 330. 

The program then extracts the current sequence 331 from the GenBank and 
performs two passes on each sequence. The first is to determine the sequence length 
332 and allocate memory for each sequence 333, and the second pass is to read the 
sequence into the allocated memory 334. Since the sequences being extracted can 
contain either DNA nucleotides or protein nucleotides, the "seqload" module can 
recognize the characters ,, A" 9 ,, T"/ , U ,, , ,, G H ,and "C". The bases "AVTVG"and "Care 
used in DNA sequences, while the bases "A\ "U", "G" and "C"are used in RNA and 
mRNA sequences. The extracted sequence is then positioned according to the type of 
nucleotides contained in the sequence 335, and the process is repeated. Once the end 
of the sequence has been reached, the "seqload" module returns the sequence length 336 
to the "k diff" module FIG. 8. 

If the user has chosen to use one or more files to locate matches, rather than the 
database, the " read 1 " module, FIG. 15, rather than the "seqload" module FIG, 14, is 
called by the B k_diff "module FIG. 8. The " read 1" module, FIG. 15, reads the sequence 
from the user specified query file 341 and allocates memory 342. This module also 
determines the query length 343, extracts sequence identification information 344, 
determines the sequence length 345, transforms each nucleotide into a digit 346 by 
calling the "let_dig" module FIG. 11, creates the query hash table 347 by calling the 
"digjet: module FIG. 16, and closes the file 348 once everything has been read in. 

First, the "read 1" module FIG. 15 allocates space for the query 342. To do this, 
the "ckalloc" module, FIG. 15 at 342, is called. This module allocates space and checks 
whether this allocation is successful (i.e., is there enough memory or has the program 
run out of memory). After allocating space, the "readl" module FIG. 15 opens the user- 
specified file 349 (the "ckopen" module, FIG. 15 at 349, is called to ensure that the 
query file can be successfully opened 349), determines the query length 343, locates a 
record with LOCUS in the head-line and extracts the LOCUS name 344 for 
identification purposes, locates the ORIGIN field in the record and then reads the query 
sequence from the file 341. Next, the sequence length is determined 345, memory is 
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allocated for the sequence 342, and the sequence is read into the query file 350. If the 
string has previously been found, processing is returned to 344. If not, then each 
character in the query file is read into memory 350. 

The characters are transformed into digits 346 using the "letjdig" module, FIG. 
11, until a valid digit has been found, and then the hash table containing the query is 
set up 347 using the module "dig Jet", FIG. 16, which transforms the digits into 
nucleotides represented by the characters "A"371, "T"371, "G"373, "C"374, and "X"375 
as a default. If the end of the file has not been reached, processing is returned to 344. 
If it has, the file is closed 348 and the query is then returned to the " read 1" module FIG. 
15 at 347. 

The "q_colour" mo dule, FIG. 17 (FIG. 13 at 325), is called by the "assembly" 
module FIG. 13 after the current sequence has been extracted from the GenBank. The 
"q_colour M module FIG. 17 performs the heart of the Mismatch Model process in that 
it performs the comparison between the query and the database or file sequences. If 
the module finds that there exists a long (i.e., greater than the minjiitjength) extended 
hit, it returns a "l"to the "assembly" module FIG. 14. Otherwise, the "q_colour" module, 
FIG. 17, returns a "0\ 

In the "q_colour" module, FIG. 17, all DNA positions are analyzed in the 
following manner. First/the entire DNA sequence is analyzed 391 to see whether each 
position is equal to zero 392 (i.e., whether it is empty or the sequence is finished). If 
it is not equal to zero 393, the "q_colour" module FIG. 20 calls the "tran" module, FIG. 
10 described above, which performs the hashing of kjuples. The "tran" module FIG. 
10 calls other modules which transform the nucleotides represented by characters into 
digits for easier processing by the program and then updates the hash with a sliding 
window. If the position is equal to zero, the current_hash position is set to newjias 
after one shift of oldjiash 390 by calling the "update" module FIG. 12. 

If the nucleotide at the current_hash position is equal to zero, processing is 
returned to 391. If not, the query position is set equal to (nucleotide at current hash 
position - 1). Next, the "q_colour M module FIG. 17 looks for the currentjiash in the 
hash table 394. If the current k_tuple does not match the query 395, then the next 
k_tuple is considered 395, and processing is returned to 391. If the current kjuple does 
match the query, then the program checks the hit's (i.e., the match's) vicinity 396 by 
calling the "hit_ext" module, FIG. 18 to determine if the hit is weak. The inventors have 
found that if the code for the module "hit_ext M is included within the module "q_colour M , 
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rather than being a separate module utilizing the parameter transfer machinery, 25% of 
CPU time can be saved. 

The "hit_ext" module FIG. 18 determines the current query position in the hit's 
vicinity 421, determines the current DNA position in the hit's vicinity 422, and creates 
the list of mismatch positions (i.e., the mismatch Jocation_ahead 423, the 
mismatch Jocation_behind 423 and the kernel match location). If the hit is weak 424, 
the " hi t_ext" module FIG. 18 returns "0" to the "^colour" module FIG. 17. If the hit 
has a chance to contain 425, the module returns 1 " to the "q_coIour" module FIG. 17. 
A hit has a chance to contain, and is therefore not considered weak, if the 
mismatch_location_ahead - the mismatch_location_behind is greater than the 
min_hit_length. If not, it is a short hit and is too weak. 

If the "hit_ext" module FIG. 18 tells the ^colour" module FIG. 17 that the hit 
was not a weak one, then the "q_colour" module determines whether the current hit is 
long enough 398 by calling the "colour" module FIG. 19. The "colour" module FIG. 19 
performs query_colour modification by the hit data, starting at pos_query and described 
by mismatch_location_ahead and mismatch Jocation_behind. After the variables to be 
used in this module are defined, variable iswjprint (which is the switch indicating the 
hit length) is initialized to zero 430. The curjength is then set equal to the length of 
the extending hit 431 (mismatch_location_behind[i] 4- mismatch_location_ahead[j]-l). 
Next, if curjength is greater than or equal to the min_hit_length 432 (i.e., the minimum 
considered probe size), the hit is considered long and isw_print is set equal to two 433. 
The value of isw_print is then returned 434 to the "q_colour" module FIG. 17. 

If the length of the extending hit is longer than the min_hit_length, the hit is 
considered long 399. Otherwise, the hit is considered short. If the hit is short, nothing 
more is done to the current hit and the module begins again. If, on the other hand, the 
hit is considered long 399, the "q_colour" module FIG. 17 prints the current extended 
hit 400. The current extended hit can be printed in ASCII, printed in a binary file, or 
printed to a memory file. The "q_colour" module FIG. 17 then repeats until the end of 
the linked list is reached, 
d. Outputs 

The output of the k_diff program in the current implementation of this invention 
may be either a binary file containing the number of extended hits and the k_mismatch 
hit locations (see FIG. 20), or the output may be kept in memory without writing it to 
a file. See Section l(d)(iv) for more detail. 
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3. Description of the H-Site Model Program 

a. Overview 

In this invention, the second hybridization strength model is termed the H-Site 
Model (see FIG. 2 for user selection of this model). One aspect of the H-Site Model 
uses a generalization of an experimental formula in general usage. The formula used 
in the H-Site Model is an expression of the fact that melting temperature Tm is a 
function of both probe length and percent of GC content. This basic formula has been 
modified in this invention to account for the presence of mismatches. Each percent of 
mismatch reduces the melting temperature Tm by an average of 1.25 degrees (2 degrees 
C for an AT mismatch, and 4 degrees C for a GC mismatch). 

In addition, this implementation of the invention does some preliminary 
preprocessing of the GenBank database to sort out and select the cDNA sequences. 
This is done by locating a keyword (in this case CDS) in each GenBank record. No 
other programs currently available allow for this combination of functions as far as the 
inventors are aware. 

There are a number of modules in the present embodiment of the H-Site Model 
contained in this invention. Each step of the processing involved in the H-Site Model 
is more fully explained below, and is accompanied by detailed flow charts. 

b. Inputs 

There are two basic user-selected inputs for the H-Site Model (see FIG. 2C): 1) 
the melting temperature Tm 22 for which probes are being designed (i.e., the melting 
temperature that corresponds to a particular experiment or condition the user desires 
to simulate); and 2) the nucleation threshold 23, which is the number of base pairs 
constituting a nucleation site. The user is also required to select the 1) target species 
11 gene sequence(s) (DNA, mRNA or cDNA) for which probes are being designed; 2) 
the preparation 12 of all sequences with which hybridizations are to be calculated; and 
3) the probe output file 13. The preparation file is the most important, as discussed 
below. 

c. Organization of the H-Site Model Program 

The current implementation of the H-Site Model program of this invention is 
distributed between five files containing numerous modules. The main file is designated 
by the inventors as "ds.cpp n in its uncompiled version. This file provides overall control 
to the entire invention. It is divided into six sections. Section 0 defines and manipulates 
global variables. Section 1 controls general variable definition and initialization 
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(including the arrays and memory blocks). It also reads and writes buffers for user input 
selections, and constructs multi buffers. 

Section 2 sets up and initializes various "snippet" variables (see section below for 
a complete definition of the term snippet), converts base pair characters to a 
representation that is 96 base pairs long and to ASCII base pair strings, and performs 
other sequence file manipulation such as comparing snippets. This section also reads 
the sequence format file, reads base pairs, checks for and extracts sequence 
identification information (such as ORIGIN and LOCUS) and filters out sequences 
beginning with numbers. 

Section 3 involves preparation file manipulation. This section performs the 
preprocessing on the PRP file discussed above. It also merges and sorts the snippet 
files, creates a PRP file and sorts it, and outputs the sorted snippets. Next, this section 
streams through the PRP file. 

Section 4 contains the essential code for H-Site Model processing (see FIGS. 21 
through 23 for details, discussed below). Streams are set up, and then RIBI 
comparisons are performed for hybridizations (see file "ribi.cpp" for definitions of RIBI 
search techniques). Next, probes are generated, binding strength is converted to melting 
temperature, and hybridizations are calculated and stored (including hybridization 
strength). Lastly, other H-Site calculations are performed. 

Section 5 is concerned with formatting and presenting diagnostic and user file 
(test. out, testl.out, and test2.out files) output. This section also handles the graphing 
functions (the MPSD diagram in particular). In addition, this section calculates the 
hairpin characteristics for the H-Site Model candidate probes. 

The second H-Site Model file, designated as "ds.h" defines data variables and 
structures. Section 1 of this file concerns generic data structures (including memory 
blocks and arrays, and file inputs and outputs). Section 2 defines the variables and 
structures used with sequences, probes and hybridizations. Section 3 defines variables 
and structures concerned with protocols (i.e., function prototypes, graphing, etc.). 

The third H-Site Model file, designated as "funcdoc.txt", contains very detailed 
documentation for this implementation of the H-Site Model program. Numerous 
variables and structures are also defined. The flow of the program is clearly shown in 
this file. 

The fourth H-Site Model file, designated as "ribi.h" handles the sequence 
comparisons. The fifth and last H-Site Model file, designated as "ribi.cpp", performs 
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internal B-Tree indexing. Definitions of Red-black Internal Binary Index (RIBI) 
searching are found in this file. Definitions are also included for the concepts keyed set, 
index, binary tree, internal binary index, paths, and red-black trees. Implementation 
notes are also included in this file, 
d. Processing 

Implementation of the H-Site Model in this invention is done in three stages. 
First, the invention creates the preparation (PRP) file, which contains all relevant 
information from the sequence database. This is the preprocessing stage discussed 
above. Next, the target is prepared by the program. Lastly, the invention calculates the 
MPSD data using the PRP file and target sequence to find probes. 

i. Creation of the Preprocessed Preparation File 

FIG. 21. Step 1: The program first opens the sequence database for reading into 
memory 461,462. Step 2: Next, as sequence base pairs are read in 462, "snippets" are 
saved to disk 463, along with loci information. A snippet is a fixed-length subsequence 
of a preparation sequence. The purpose of snippets is to allow the user to examine a 
small portion of a preparation sequence together with its surrounding base pairs. 
Snippets in the implementation of this invention are 96 base pairs long (except for 
snippets near the end or beginning of a sequence, which may have fewer base pairs). 
The "origin" of the snippet "is in position 40. For snippets taken near the beginning of 
a sequence, some of the initial 40 bases are undefined. For snippets near the end of a 
sequence, some of the final 55 bases are undefined. Snippets are arranged in the 
preparation file (PRP) in sorted order (lexicographical order beginning at position 40). 
In this invention, the term "lexicographical order" means a preselected order, such as 
alphabetical, numeric or alphanumeric. In order to conserve space, snippets are only 
taken at every 4th position of the preparation sequence. 

Step 3: The snippets are merge sorted 464 to be able to search quickly for 
sequences which pass the "screen", discussed below. Step 4: The merged file is 
prepended with identifiers for the sources of the snippets 465. This is done to identify 
the loci from which hybridizations arise. 

ii. Target Preparation 

FIG. 22. Step 1: The target sequence file is opened 471 and read into memory 
472. For each position in the target mRNA, the probe defined at that starting position 
is the shortest subsequence starting at that position whose hybridization strength is 
greater than the user specified melting temperature Tm. Typically, the probes are of 
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by one base pair 475 to correspond to the fact that snippets are only taken at every four 
base pairs. A screen is a subsequence of the target mRNA of length equal to the 
screening threshold specified by the user. The screens are then indexed 476 and sorted 
in memory 477. 

iii. Calculation of the MPSD Data 
FIG. 23. Step 3: This step is the heart of the process. Step 3a: The program 
streams through the following five items in sync, examining them in sequential order: 
the snippet file and the four lists of screens 481-484. Step 3b: Each snippet is 
compared to a screen 485. Step 3c: If the snippet does not match, whichever stream 
is behind is advanced 486 and Step 3b is repeated. If the snippet does match, Step 4 
is performed. 

Step 4: If a snippet and a matching screen were found in Step 3b 487, the 
hybridization strength of the binding between the sequence containing the snippet and 
all of the probes containing the screen is calculated (see Step 5). Double counting is 
avoided by doing this only for the first matched screen containing the probe. Each pair 
of bases is examined and assigned a numerical binding strength. An AT pair would be 
assigned a lower binding strength than a GC pair because AT pairs have a lower 
melting temperature Tm. The process is explained more fully below at Step 5b. 

Step 5: The hybridization strengths between sequence and all the probes 
containing it are calculated using a dynamic programming process. The process is as 
follows: Step 5a: Begin at the position of the first probe containing the given screen 
but not containing any other screens which start at an earlier position and also match 
the sequence. This is done to avoid double counting. Two running totals are 
maintained: a) boundStrength, which represents the hybridization strength contribution 
which would result if the sequence and probe were to match exactly for all base pairs 
to the right of the current position, and b) unboundStrength, which represents the 
strength of the maximally binding region. Step 5b: At each new base pair, the variable 
boundStrength is incremented by 71 if the sequence and probe match and the matched 
base pair is GC 489, incremented by 30 if the matched base pair is AT 490 (i.e., this 
number is about 42.25% of the first number 71), and decremented by 74.5 if there is not 
a match 488 (i.e., this number is about 5% larger than the first number 71). Step 5c: 
If the current boundStrength exceeds the current unboundStrength 491 (which was 
originally initialized to zero), a new binding region has been found, and 
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unboundStrength is set equal to boundStrength 492 . Step 5d : If the current 
boundStrength is negative, boundStrength is reset to zero 493. Step 5e: If the current 
position is at the end of a probe, the results (the hybridization strengths) are tallied for 
that probe. Step 5f: If the current position is at the end of the last probe containing 
the screen, the process stops. 

Step 6: A tally is kept of the number and melting temperature of the matches 
for each candidate probe, and the location of the best 20 candidates, using a priority 
queue (reverse order by hybridization strength number) 494. Step 7: A numerical 
"score" is kept for each preparation sequence by tallying the quantity exp (which can be 
expressed as Ee" Trr ) for each match 495, where Tm is the melting temperature for the 
"perfect" match, the probe itself. In other words, the probe hybridizes "perfectly" to its 
target. 

Step 8: Hairpins are calculated by first calculating the complementary probe. 
In other words, the order of the bases in the candidate probe are reversed (CTATAG 
to GATATC), and complementary base pairs are substituted (A for T, T for A, G for 
C, and C for G, changing GATATC to CTATAG in the above example). Next, the 
variable representing the maximum hairpin length for a candidate probe is initialized to 
zero, as is the variable representing a hairpin's distance. For each offset, the original 
candidate probe and the complementary probe just created are then aligned with each 
other and compared. The longest match is then found. If any two matches have the 
same length, the one with the longest hairpin distance (i.e., the number of base pairs 
separating the match) is then saved. 

Step 9: The preparation sequences are then sorted 496 and displayed in rank 
order, from best to worst 497. Step 10: The resulting MPSD, which includes all 
candidate probes, is then displayed on the screen. Step 11: The best 20 matches are 
also printed or displayed in rank order, as the user requests 497. 

e. Outputs 

The outputs of the H-Site Model as currently implemented in this invention are 
fully described in Section l(d)(iv), above, and illustrated in FIGS. 4 through 6. Samples 
of the two output files created by the H-Site Model are shown in FIGS. 24 A and 24B. 

4. Description of the Mitsuhashi Probe Selection Diagram Processing 
Once the Mitsuhashi Probe Selection Diagram (MPSD) data has been calculated 
by the H-Site Model program (see stage three and FIG. 23, discussed above), it is 
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necessary to convert this data to pixel format and plot a graph. An overview of this 
process is shown in FIG. 25. First, the program calculates the output (x,y) ranges 500. 
Next, these are converted to a logarithmic scale 501. The values are then interpolated 
502, and a bitmap is created 503. Lastly, the bitmap is displayed on the screen 504 in 
MPSD format (discussed above in section l(e)(i)). A sample MPSD is shown in FIG. 
4. 

5. Description of the Matchlnfo Window Processing 

The Probelnfo and Matchlnfo windows are discussed in great detail in Section 
l(e)(ii), and a sample of these windows is shown in FIG. 5. An overview of the 
processing involved in creating the Matchlnfo portion of the window is given in the flow 
chart in FIG. 26. First, as the user moves the MPSD cursor 520 (seen as a vertical line 
bisecting the MPSD window), the program updates the position of the candidate probe 
shown under that cursor position 521. Next, based upon the candidate probe's position, 
the program updates the sequence 522 and hairpin information 523 for that probe. This 
updated information is then displayed in an updated match list 524, shown in the 
Matchlnfo window. 

The above described embodiments of the present invention "are merely descriptive 
of its principles and are not to be considered limiting. The scope of the present 
invention instead shall be determined from the scope of the following claims including 
their equivalents. 
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1. A programmed computer system for designing optimal oligonucleotide 
sequences for use with a gene sequence data source comprising: 

first input means for introducing user-selected gene sequence into the 
computer system; 

memory means for storing user-selected gene sequence; 

means for accessing gene sequence data from said gene sequence data 

source; 

means for performing exact and inexact match modeling between gene 

sequences; 

means for performing hybridization strength modeling on gene sequences; 
means for selecting either of said modeling means; and 
means for presenting the results of said modeling to present candidate 
oligonucleotide sequences . 

2. A programmed computer system in accordance with Claim 1 wherein said 
means for performing exact and inexact match modeling utilizes said accessing means 
to introduce a user-selected "set of gene sequence data and a user-selected set of target 
gene sequence data from said gene sequence data source into the computer system and 
said memory means to store said gene sequence data and said target gene sequence data 
and wherein said means for performing exact and inexact match modeling includes: 

means for determining a minimum sequence length; 

means for creating a look-up hash table and linked list in memory for each 
gene sequence in said gene sequence data and each of said target gene sequences; 

means for calculating the minimum length of any matching gene 
subsequence of said gene sequence data and said target gene sequence data; 

means for comparing each base pair character in each said target sequence 
stored in a hash table in memory to each base pair character of said gene sequence 
stored in a hash table in memory; 

means for finding a matching seed by determining if the said comparison 
results in a matching gene subsequence of length equal to said calculated minimum 
length; 
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means for comparing base pair characters behind and ahead of said seed 
to determine if there exists an extended match of a subsequence of base pair characters 
of length greater than the calculated minimum length, resulting in a current hit 
sequence; 

means for calculating whether said current hit sequence is longer than said 
minimum sequence length, resulting in a current candidate oligonucleotide sequence; 

means for storing said current candidate oligonucleotide sequence; and 
wherein said presenting means provides said current candidate 
oligonucleotide sequence to the user. 

3. A programmed computer system in accordance with Claim 2 wherein said 
computer system includes: 

means for calculating the melting temperature for each candidate 
oligonucleotide sequence; 

means for tracking the number and melting temperature of the matches 
for each candidate oligonucleotide sequence; 

means for tracking the location of a set number of the best candidate 
oligonucleotide sequences; and 

wherein said presenting means is operative to present said additional 
results to the user; and 

wherein said presenting means provides said melting temperature to the 

user. 

4. A programmed computer system in accordance with Claim 2 wherein said 
computer system includes: 

means for determining the length of sequences from said target gene 
sequence data. 

5 . A programmed computer system in accordance with Claim 2 wherein said 
computer system includes: 

means for determining the length of sequences from said set of gene 
sequence data. 
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6. A programmed computer system in accordance with Claim 2 wherein said 
computer system includes: 

means for copying the LOCUS name for each said gene sequence into said 
memory means; and 

means for linking said LOCUS name with each said gene sequence. 

7. A programmed computer system in accordance with Claim 2 wherein said 
means for performing exact and inexact match modeling utilizes said accessing means 
to introduce a user-selected minimum sequence length from said gene sequence data 
source into the computer system and said memory means to store said minimum 
sequence length. 

8. A programmed computer system in accordance with Claim 2 wherein said 
computer system includes: 

means for calculating the melting temperature for each candidate 
oligonucleotide sequence; 

means for tracking the number and melting temperature of the matches 
for each candidate oligonucleotide sequence; 

means for tracking the location of a set number of the best candidate 
oligonucleotide sequences employing a priority queue by sorting said candidate 
oligonucleotide sequences in reverse order and sorting said candidate oligonucleotide 
sequences by hybridization strength; 

wherein said presenting means is operative to present said additional 
results to the user; and 

wherein said presenting means provides said melting temperature to the 

user. 

9. A programmed computer system in accordance with Claim 2 wherein said 
first input means in operative to introduce a user-selected maximum number of 
mismatches and a user-selected minimum candidate oligonucleotide sequence length into 
the computer system, and wherein said means for calculating the minimum length of any 
matching gene subsequence of said gene sequence data and said target gene sequence 
data comprises the steps of: 
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means for subtracting said maximum number of mismatches from said 
minimum candidate oligonucleotide sequence length to give a first result; 

means for dividing said first result by said maximum number of mismatches 
plus one to give a second result; 

means for incrementing said second result by one if the remainder is not 
equal to zero to give a third result; and 

means for truncating said third result to an integer. 

10. A programmed computer system in accordance with Claim 9 wherein said 
means for calculating the hairpin characteristics of said candidate oligonucleotide 
sequence comprises the steps of: 

calculating a complementary sequence to the candidate oligonucleotide 
sequence by reversing the base pair order of the candidate oligonucleotide sequence and 
substituting complementary base pairs; 

comparing each character of said original candidate oligonucleotide 
sequence and said complementary sequence; 

finding the longest match between said original candidate oligonucleotide 
sequence and said complementary sequence; and 

saving the match with the longest hairpin distance if any two matches have 
the same length; 

means for storing hairpin characteristics; and 

wherein said presenting means provides said hairpin characteristics to the 

user. 

11. A programmed computer system in accordance with Claim 2 wherein said 
computer system includes a means for calculating the hairpin characteristics of said 
candidate oligonucleotide sequence . 

12. A programmed computer system in accordance with Claim 2 wherein said 
means for preprocessing said set of target gene sequence data and said set of gene 
sequence data comprises the steps of: 

searching for sequences without introns in said target gene sequences and 
said gene sequences; 
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extracting target gene sequences and gene sequences that do not contain 

introns; and 

storing said extracted target gene sequences and gene sequences in 

memory. 

13. A programmed computer system in accordance with Claim 1 wherein said 
means for performing hybridization strength modeling utilizes said first input means to 
introduce a user-selected screening threshold into the computer system and said 
accessing means to introduce a user-selected set of gene sequence data and a user- 
selected set of target gene sequence data from said gene sequence data source into the 
computer system, and said memory means to store said gene sequence data, said target 
gene sequence data and said screening threshold and wherein said means for performing 
hybridization strength modeling comprises: 

means for preprocessing said target gene sequence data and said gene 
sequence data by selecting only those sequences without introns; 

means for forming a preparation file of gene sequence fragments by cutting 
said target gene sequences into fixed length target gene subsequences and sorting said 
subsequences in lexicographical order; 

means for merge sorting said gene sequences; 

means for forming multiple lists of screens by forming lists of subsequences 
of the preparation file of length equal to said screening threshold; 

means for indexing, sorting and storing said screens in said memory means; 

means for sequentially comparing said preparation file gene sequences with 
each of said screens to design candidate oligonucleotide sequences; 

means for calculating the hybridization strengths between a gene sequence 
and all candidate oligonucleotide sequences containing that gene sequence by accounting 
for Guanine-Cytosine (GC) and Adenine-Thymine (AT) base pair content of the gene 
sequence and the number of mismatches between said preparation file sequences and 
a said screen when said comparison results in a match; 

means for preparing the candidate oligonucleotide sequence and 
hybridization strength for presentation to the user; and 

wherein said presenting means provides the candidate oligonucleotide 
sequence and hybridization strength to the user. 
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14. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes: 

means for calculating the melting temperature for each candidate 
oligonucleotide sequence ; 

means for tracking the number and melting temperature of the matches 
for each candidate oligonucleotide sequence; 

means for tracking the location of a set number of the best candidate 
oligonucleotide sequences; 

means for preparing the melting temperature for presentation to the user; 

and 

wherein said presenting means provides the melting temperature to the 

user. 



15. A programmed computer system in accordance with Claim 14 wherein said 
means for calculating said candidate oligonucleotide sequence's melting temperature 
comprises: 

solving the formula Tm =81.5- 16.6(log[Na]) - .63 %(formamide) + ((.41 
(%(G + C)) - 600)/N), wherein log[Na] is the sodium concentration, %(G 4- C) is the 
fraction of matched base pairs which are G-C complementary, N is the sequence length 
and wherein the number of mismatches is equal to zero. 

16. A programmed computer system in accordance with Claim 15 wherein said 
computer system includes: 

means for reducing a candidate oligonucleotide probe's calculated melting 
temperature by a certain amount for each percent of mismatch between the candidate 
oligonucleotide sequence and a user-selected target gene sequence based upon the 
assumption that there are an equal number of GC and AT base pair mismatches. 

17. A programmed computer system in accordance with Claim 16 wherein said 
means for reducing a candidate oligonucleotide sequence's calculated melting 
temperature comprises the steps of: 

reducing said calculated melting temperature by 2 degrees Celsius if an AT 
mismatch exists; and 
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reducing said calculated melting temperature by 4 degrees Celsius if a GC 
mismatch exists. 



18. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes: 

means for assigning a numerical score to each said gene sequence; and 
means for sorting said gene sequences in accordance with said numerical 

score. 

19. A programmed computer system in accordance with Claim 13 wherein said 
means for performing hybridization strength modeling utilizes said accessing means for 
copying the LOCUS name for each said gene sequence into said memory means, and 
said memory means; and 

means for prepending said gene sequence with said LOCUS name. 

20. A programmed computer system in accordance with Claim 13 wherein four 
lists of screens are formed by said list forming means. 

21. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes a means of shifting each screen by at least one base pair as 
it is formed by said list forming means. 

22. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes: 

means for calculating the melting temperature for each candidate 
oligonucleotide sequence ; 

means for tracking the number and melting temperature of the matches 
for each candidate oligonucleotide sequence; 

means for tracking the location of a set number of the best candidate 
oligonucleotide sequences employing a priority queue by sorting said candidate 
oligonucleotide sequences in reverse order and sorting said candidate oligonucleotide 
sequences by hybridization strength; 

means for preparing the melting temperature for presentation to the user; 

and 
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wherein said presenting means provides the melting temperature to the 

user. 

23. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes: 

means for assigning a numerical score to each said gene sequence by 
tallying the quantity "exp" where "exp" = Ee" Tm and wherein Tm is the melting 
temperature for the said gene sequence; and 

means for sorting said gene sequences in accordance with said numerical 

score. 

24. A programmed computer system in accordance with Claim 13 wherein said 
means for calculating the hybridization strengths between a gene sequence and all 
candidate oligonucleotide sequences containing that gene sequence comprises the steps 
of: 

accessing gene sequence data from said gene sequence data source; 

comparing base pairs of a first gene sequence and a second gene sequence 
to determine if a match exists; 

incrementing said first gene sequence's bound strength by some first 
number if a base pair character in said first gene sequence and said second gene 
sequence match and the matched base pair is equal to a combination of the bases 
Guanine (G) and Cytosine (C); 

incrementing said first gene sequence's bound strength by some second 
number if a base pair character in said first gene sequence and said second gene 
sequence match and the matched base pair is equal to a combination of the bases 
Adenine (A) and Thymine (T); 

decrementing said first gene sequence's bound strength by a third number 
if there is no match in base pairs between said first gene sequence and said second gene 
sequence; 

comparing said first gene sequence's bound strength to said first gene 
sequence's unbound strength; 

setting said first gene sequence's unbound strength equal to its bound 
strength if said first gene sequence's bound strength is greater than said first gene 
sequence's unbound strength; and 
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resetting said first gene sequence's bound strength to zero if said first gene 
sequence's unbound strength is less than zero. 

25. A programmed computer system in accordance with Claim 24 wherein said 
first and second numbers are greater than zero. 

26. A programmed computer system in accordance with Claim 24 wherein said 
second number is in the order of 42% of said first number. 

27. A programmed computer system in accordance with Claim 24 wherein said 
third number is in the order of 5% larger than said first number. 

28. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes a means for calculating the hairpin characteristics of said 
candidate oligonucleotide sequence; 

means for preparing the hairpin characteristics for presentation to the user; 

and 

wherein said presenting means provides the hairpin characteristics to the 

user. 

29. A programmed computer system in accordance with Claim 28 wherein said 
means for calculating the hairpin characteristics of said candidate oligonucleotide 
sequence comprises the steps of: 

calculating a complementary sequence to the candidate oligonucleotide 
sequence by reversing the base pair order of the candidate oligonucleotide sequence and 
substituting complementary base pairs; 

comparing each character of said original candidate oligonucleotide 
sequence and said complementary sequence; 

finding the longest match between said original candidate oligonucleotide 
sequence and said complementary sequence; and 

saving the match with the longest hairpin distance if any two matches have 
the same length; 

means for preparing the hairpin characteristics for presentation to the user; 

and 
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wherein said presenting means provides the hairpin characteristics to the 

user. 

30. A programmed computer system in accordance with Claim 13 wherein said 
fixed-length subsequences are calculated by a method comprising the steps of; 

locating the origin of said subsequence in a set position of said target gene 
sequence in said preparation file; 

cutting a subsequence that is a fixed-length long every preselected number 
of positions of said target gene sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

31. A programmed computer system in accordance with Claim 30 wherein the 
origin of said subsequence is located at position 40 of said target sequence in said 
preparation file. 

32. A programmed computer system in accordance with Claim 13 wherein said 
fixed-length subsequences are calculated by a method comprising the steps of: 

locating the origin of said subsequence in the 40th position of said target 
gene sequence in said preparation file; 

cutting a subsequence that is 96 base pairs long of said target gene 
sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

33. A programmed computer system in accordance with Claim 13 wherein said 
computer system includes means for prepending said preparation file subsequences with 
identifiers for the sources of each subsequence. 

34. A programmed computer system in accordance with Claim 1 wherein said 
presenting means to provide the results of said matching and modeling to display 
candidate oligonucleotide sequences includes means for displaying in multiple 
dimensions the gene sequences which result from the comparisons and calculations 
characterized in that said display format exhibits 
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the starting position of each candidate oligonucleotide sequence in one 

dimension; 

the specificity of a candidate oligonucleotide sequence's hybridization with 
the target gene sequence in a second dimension; and 

superimposed melting temperatures of gene sequences in contrasting 
presentations in at least an apparent third dimension. 



35. A programmed computer system in accordance with Claim 34 wherein said 
display further includes a cursor moveable along one dimension of said display that 
selects a position for an expansion of data representing the homology between the 
candidate oligonucleotide sequences and said gene sequence data; and 

wherein said display is operative to display in alphanumeric form the 
homology between the candidate oligonucleotide sequences and said gene sequence data. 

36. A programmed computer system in accordance with Claim 34 wherein said 
display is further operative to provide an expansion of data including presenting 

false hybridizations at various melting temperatures 
for all candidate oligonucleotide sequences; 

the location of each false hybridization; 

a candidate oligonucleotide sequence ' s starting 
position; and 

hairpin characteristics of each candidate 
oligonucleotide sequence . 

37. A programmed computer system in accordance with Claim 34 wherein said 
display format data is outputted to a printing means. 

38. A programmed computer system in accordance with Claim 34 wherein said 
display format data is saved to a data file. 

39. A programmed computer system in accordance with Claim 34 wherein said 
display format data is exported to another computer system. 
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40. A programmed computer system in accordance with Claim 34 wherein said 
display further includes a cursor moveable along one dimension of said display that 
selects a position for an expansion of data representing the homology between the 
candidate oligonucleotide sequences and said gene sequence data; and 

wherein said moveable cursor may be positioned by the user to select and 
save particular candidate oligonucleotide sequence information; and 

wherein said display is operative to display in alphanumeric form the 
homology between the candidate oligonucleotide sequences and said gene sequence 
data. 

41. A programmed computer system in accordance with Claim 40 wherein said 
method of selecting and saving particular candidate oligonucleotide sequence 
information comprises capturing candidate oligonucleotide sequence information at the 
user-selected point and storing said information in said memory means. 

42. A programmed computer system in accordance with Claim 41 wherein said 
user-selected candidate oligonucleotide sequence information is exported to another 
computer system. 

43. A programmed computer system in accordance with Claim 34 wherein said 
means for displaying comprises the steps of: 

calculating display output ranges; 

converting said output ranges to a logarithmic scale; 

interpolating said converted values; 

creating a bitmap of said interpolations; and 

displaying said bitmap on a display device. 

44. A programmed computer system in accordance with Claim 34 wherein said 
means for displaying comprises the steps of: 

converting said result values to pixels; 
filling a pixel array with said pixels; 
performing a binary search into said pixel array; 

determining the number of pixels per candidate oligonucleotide sequence 
to be displayed; 
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interpolating said pixels at the value of pixels per position minus one; 
computing an array of said pixel array; and 
plotting the results on a display device. 

45. A programmed computer system in accordance with Claim 1 wherein said 
means for performing exact and inexact match modeling utilizes said accessing means 
to introduce a user-selected set of gene sequence data and a user-selected set of target 
gene sequence data from said gene sequence data source into the computer system and 
said memory means to store said gene sequence data and said target gene sequence 
data and wherein said means for performing exact and inexact match modeling includes: 

means for determining a minimum sequence length; 

means for creating a look-up hash table and linked list in memory for 
each gene sequence in said gene sequence data and each of said target gene sequences; 

means for calculating the minimum length of any matching gene 
subsequence of said gene sequence data and said target gene sequence data; 

means for transforming base characters in each said target sequence and 
in each said gene sequence into numeric digits; 

means for comparing each base pair digit in each said target sequence 
stored in a hash table in memory to each base pair digit of said gene sequence stored 
in a hash table in memory; 

means for finding a matching seed by determining if the said comparison 
results in a matching gene subsequence of length equal to said calculated minimum 
length; 

means for comparing base pair digits behind and ahead of said seed to 
determine if there exists an extended match of a subsequence of base pair digits of 
length greater than the calculated minimum length, resulting in a current hit sequence; 

means for calculating whether said current hit sequence is longer than said 
minimum sequence length, resulting in a current candidate oligonucleotide sequence; 

means for storing said current candidate oligonucleotide sequence; and 

wherein said presenting means provides said current candidate 
oligonucleotide sequence to the user. 

46. A programmed computer system for designing candidate oligonucleotide 
sequences for use with a gene sequence data source including: 
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first input means for introducing user-selected gene sequence, design, 
model and presentation criteria and a user-specified sequence length into the computer 
system; 

memory means for storing said gene sequence, design, model and 
presentation criteria and said sequence length; 

means for accessing gene sequence data from said gene sequence data 

source; 

wherein said accessing means is operative to introduce a user-selected set 
of gene sequence data and a user-selected set of target gene sequence data from said 
gene sequence data source into the computer system; 

wherein said criteria are used for comparison of gene sequence data and 
target gene sequence data; 

means for comparing said gene sequences against said target gene 
sequences employing said criteria; 

means for calculating candidate oligonucleotide sequences of said 
sequence length that are either common to a pool of user-specified gene sequences or 
specific to a particular user-specified gene sequence; 

means for calculating the homology between the candidate oligonucleotide 
sequences and said gene sequence data; 

means for calculating a candidate oligonucleotide sequence's hairpin 

characteristics; 

means for displaying in multiple dimensions the gene sequences which 
result from the comparisons and calculations characterized in that said display format 
exhibits: 

the starting position of each candidate oligonucleotide 
sequence in one dimension; 

a candidate oligonucleotide sequence's specificity to 
the target gene sequence in a second dimension; and 

superimposed melting temperatures of gene 
sequences in contrasting presentations in at least an 
apparent third dimension; 

wherein said display further includes a cursor moveable along one 
dimension of said display that selects a position for an expansion of data representing 
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the homology between the candidate oligonucleotide sequences and said gene sequence 
data; 

wherein said display is operative to display in alphanumeric form the 
homology between the candidate oligonucleotide sequences and said gene sequence 
data; and 

wherein said display is operative to provide an expansion of data including 

presenting 

false hybridizations at various melting temperatures 
for all candidate oligonucleotide sequences; 

the location of each false hybridization; 

a candidate oligonucleotide sequence * s starting 
position; and 

hairpin characteristics of each candidate 
oligonucleotide sequence . 

47. A method for designing candidate oligonucleotide sequences by 
performing exact and inexact match modeling for use with a gene sequence data source 
comprising the steps of: 

introducing user-selected gene sequence into a computer system; 

accessing gene sequence data from said gene sequence data source; 

storing user-selected gene sequence in the memory of the computer 

system; 

accessing the gene sequence source to introduce the user-selected set of 
gene sequence data and a user-selected set of target gene sequence data from said gene 
sequence data source into the computer system; 

storing said gene sequence data and said target gene sequence data in the 
memory of the computer system; 

determining a minimum sequence length; 

creating a look-up hash table and linked list in memory for each gene 
sequence in said gene sequence data and each of said target gene sequences; 

calculating the minimum length of any matching gene subsequence of said 
gene sequence data and said target gene sequence data; 



CGK00035679 



WO 94/11837 PCT/US93/ 10507 

■SO 

comparing each base pair character in each said target sequence stored in 
a hash table in memory to each base pair character of said gene sequence stored in a 
hash table in memory; 

determining a matching seed by determining if the said comparison results 
in a matching gene subsequence of length equal to said calculated minimum length; 

comparing base pair characters behind and ahead of said seed to 
determine if there exists an extended match of a subsequence of base pair characters of 
length greater than the calculated minimum length, resulting in a current hit sequence; 

calculating whether said current hit sequence is longer than said minimum 
sequence length, resulting in a current candidate oligonucleotide sequence; 

storing said current candidate oligonucleotide sequence in the memory of 
the computer system; and 

presenting a representation of said current candidate oligonucleotide 
sequence to the user. 

48. A method in accordance with Claim 47 wherein said method includes the 
steps for performing additional calculations for each candidate oligonucleotide probe, 
said additional calculations comprising: 

calculating the melting temperature for each candidate oligonucleotide 

sequence; 

tracking the number and melting temperature of the matches for each 
candidate oligonucleotide sequence; 

tracking the location of a set number of the best candidate oligonucleotide 
sequences; and 

presenting said additional results to the user. 

49. A method in accordance with Claim 47 wherein said method includes the 
step of transforming base characters into numeric digits. 

50. A method in accordance with Claim 47 wherein said method includes the 
step of determining the length of sequences from said target gene sequence data. 

51. A method in accordance with Claim 47 wherein said method includes the 
step of determining the length of sequences from said set of gene sequence data. 
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52. A method in accordance with Claim 47 wherein said method includes the 
steps of: 

copying the LOCUS name for each said gene sequence into the memory 
of the computer system; and 

linking said LOCUS name with each said gene sequence. 

53. A method in accordance with Claim 47 wherein said method includes the 
steps of: 

introducing a user-selected minimum sequence length into the computer 

system; and 

storing said minimum sequence length in the memory of the computer 

system. 

54. A method in accordance with Claim 47 wherein said method includes the 
steps for performing additional calculations for each candidate oligonucleotide probe, 
said additional calculations comprising: 

calculating the melting temperature for each candidate oligonucleotide 

sequence; 

tracking the number and melting temperature of the matches for each 
candidate oligonucleotide sequence ; 

tracking the location of a set number of the best candidate oligonucleotide 
sequences employing a priority queue by sorting said candidate oligonucleotide 
sequences in reverse order and sorting said candidate oligonucleotide sequences by 
hybridization strength; and 

presenting said additional results to the user. 

55. A method in accordance with Claim 47 wherein said step for calculating 
the minimum length of any matching gene subsequence comprises: 

introducing a user-selected maximum number of mismatches and a user- 
selected minimum candidate oligonucleotide sequence length into the computer system; 

subtracting said maximum number of mismatches from said minimum 
candidate oligonucleotide sequence length to give a first result; 
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dividing said first result by said maximum number of mismatches plus one 
to give a second result; 

incrementing said second result by one if the remainder is not equal to 
zero to give a third result; and 

truncating said third result to an integer. 

56. A method in accordance with Claim 47 wherein said method includes the 
step of calculating the hairpin characteristics of said candidate oligonucleotide sequence. 

57. A method in accordance with Claim 47 wherein said method includes the 
step of calculating the hairpin characteristics of said candidate oligonucleotide sequence 
comprising: 

calculating a complementary sequence to the candidate oligonucleotide 
sequence by reversing the base pair order of the candidate oligonucleotide sequence and 
substituting complementary base pairs; 

comparing each character of said original candidate oligonucleotide 
sequence and said complementary sequence; 

finding the longest match between said original candidate oligonucleotide 
sequence and said complementary sequence; and 

saving the match with the longest hairpin distance if any two matches have 
the same length. 

58. A method for designing candidate oligonucleotide sequences by 
performing hybridization strength modeling for use with a gene sequence data source 
comprising the steps of: 

introducing user-selected gene sequence and a user-selected screening 
threshold into a computer system; 

storing user-selected gene sequence and said screening threshold in the 
memory of the computer system; 

accessing the gene sequence source to introduce the user-selected set of 
gene sequence data and a user-selected set of target gene sequence data from said gene 
sequence data source into the computer system; 

storing said gene sequence data and said target gene sequence data in the 
memory of the computer system; 
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preprocessing said target gene sequence data and said gene sequence data 
by selecting only those sequences without introns; 

forming a preparation file of gene sequence fragments by cutting said 
target gene sequences into fixed length target gene subsequences and sorting said 
subsequences in lexicographical order; 

merge sorting said gene sequences; 

forming multiple lists of screens by forming lists of subsequences of the 
preparation file of length equal to said screening threshold; 

indexing and sorting said screens in memory; 

storing said screens in the memory of the computer system; 

sequentially comparing said preparation file gene sequences with each of 
said screens to design candidate oligonucleotide sequences; 

calculating the hybridization strengths between a gene sequence and all 
candidate oligonucleotide sequences containing that gene sequence by accounting for 
Guanine-Cytosine (GC) and Adenine-Thymine (AT) base pair content of the gene 
sequence and the number of mismatches between said preparation file sequences and 
a said screen when said comparison results in a match; 

preparing the candidate oligonucleotide sequence and hybridization 
strength for presentation to the user; and 

presenting the candidate oligonucleotide sequence and hybridization 
strength to the user. 

59. A method in accordance with Claim 58 wherein said method includes the 
steps for performing additional calculations for each candidate oligonucleotide probe, 
said additional calculations comprising: 

calculating the melting temperature for each candidate oligonucleotide 

sequence; 

tracking the number and melting temperature of the matches for each 
candidate oligonucleotide sequence; 

tracking the location of a set number of the best candidate oligonucleotide 
sequences; and 

presenting said additional results to the user. 
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60. A method in accordance with Claim 58 wherein the step for preparing the 
candidate oligonucleotide sequence for presenting to the user comprises: 

assigning a numerical score to each said gene sequence; 

sorting said gene sequences in accordance with said numerical score; and 
displaying a representation of the resulting candidate oligonucleotide 
sequence and said gene sequences. 

61. A method in accordance with Claim 58 wherein said method includes the 
steps of: 

copying the LOCUS name for each said gene sequence into the memory 
of the computer system; and 

prepending said gene sequence with said LOCUS name. 

62. A method in accordance with Claim 58 wherein the step for forming lists 
of screens produces four lists of screens. 

63. A method in accordance with Claim 58 wherein said method includes a 
the step of shifting each screen by one base pair as it is formed. 

64. A method in accordance with Claim 58 wherein said method includes the 
steps for performing additional calculations for each candidate oligonucleotide probe, 
said additional calculations comprising: 

calculating the melting temperature for each candidate oligonucleotide 

sequence; 

tracking the number and melting temperature of the matches for each 
candidate oligonucleotide sequence; 

tracking the location of a set number of the best candidate oligonucleotide 
sequences employing a priority queue by sorting said candidate oligonucleotide 
sequences in reverse order and sorting said candidate oligonucleotide sequences by 
hybridization strength; and 

presenting said additional results to the user. 

65. A method in accordance with Claim 58 wherein said method for preparing 
the results for presenting to the user comprises: 
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assigning a numerical score to each said gene sequence by tallying the 
quantity "exp" where "exp" = Se" Tm and wherein Tm is the melting temperature for the 
said gene sequence; 

sorting said gene sequences in order of the numerical score; and 
displaying a representation of the resulting candidate oligonucleotide 
sequence and said gene sequences. 

66. A method in accordance with Claim 58 for use with a gene sequence data 
source, programmed to determine hybridization strength comprising the steps of: 

comparing base pairs of a first gene sequence and a second gene sequence 
to determine if a match exists; 

incrementing said first gene sequence's bound strength by some first 
number if a base pair character in said first gene sequence and said second gene 
sequence match and the matched base pair is equal to a combination of the bases 
Guanine (G) and Cytosine (C); 

incrementing said first gene sequence's bound strength by some second 
number if a base pair character in said first gene sequence and said second gene 
sequence match and the matched base pair is equal to a combination of the bases 
Adenine (A) and Thymine (T); 

decrementing said first gene sequence's bound strength by a third number 
if there is no match in base pairs between said first gene sequence and said second gene 
sequence; 

comparing said first gene sequence's bound strength to said first gene 
sequence's unbound strength; 

setting said first gene sequence's unbound strength equal to its bound 
strength if said first gene sequence's bound strength is greater than said first gene 
sequence's unbound strength; and 

resetting said first gene sequence's bound strength to zero if said first gene 
sequence's unbound strength is less than zero. 

67. A method in accordance with Claim 66 wherein said first and second 
numbers are greater than zero. 
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68. A method in accordance with Claim 66 wherein said second number is 
in the order of 42% of said first number. 

69. A method in accordance with Claim 66 wherein said second number is 
in the order of 5% larger than said first number. 

70. A method in accordance with Claim 58 wherein said method includes the 
step of calculating the hairpin characteristics of said candidate oligonucleotide sequence. 

71. A method in accordance with Claim 70 wherein the step of calculating the 
hairpin characteristics of said candidate oligonucleotide sequence includes the steps of: 

calculating a complementary sequence to the candidate oligonucleotide 
sequence by reversing the base pair order of the candidate oligonucleotide sequence and 
substituting complementary base pairs; 

comparing each character of said original candidate oligonucleotide 
sequence and said complementary sequence; 

finding the longest match between said original candidate oligonucleotide 
sequence and said complementary sequence; and 

saving the match with the longest hairpin distance if any two matches have 
the same length. 

72. A method in accordance with Claim 58 wherein said fixed-length target 
gene subsequences are calculated by a method comprising the steps of: 

locating the origin of said subsequence in a set position of said target gene 
sequence in said preparation file; 

cutting a subsequence that is a fixed-length long every preselected number 
of positions of said target gene sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

73. A method in accordance with Claim 72 wherein the origin of said 
subsequence is located at position 40 of said target sequence in said preparation file. 
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74. A method in accordance with Claim 58 wherein said fixed-length 
subsequences are calculated by a method comprising the steps of: 

locating the origin of said subsequence in the 40th position of said target 
gene sequence in said preparation file; 

cutting a subsequence that is 96 base pairs long of said target gene 
sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

75. A method in accordance with Claim 58 wherein said method includes the 
step of prepending said preparation file subsequences with identifiers for the sources of 
each subsequence. 

76. A method in accordance with Claim 58 wherein said method includes the . 
step of calculating an candidate oligonucleotide sequence's melting temperature 
comprising: 

solving the formula Tm =81.5- 16.6(log[Na]) - .63 %(formamide) + ((.41 
(%(G + C)) - 600)/N); 

wherein log[Na] is the sodium concentration, %(G 4- C) is the fraction of 
matched base pairs which are G-C complementary, N is the sequence length; and 

wherein the number of mismatches is equal to zero. 

77. A method in accordance with Claim 58 wherein said method includes the 
step for reducing a candidate oligonucleotide sequence's calculated melting temperature 
by a preselected amount for each percent of mismatch between the candidate 
oligonucleotide sequence and a user-selected target gene sequence based upon the 
assumption that there are an equal number of GC and AT base pair mismatches. 

78. A method in accordance with Claim 58 wherein said method includes the 
step for reducing a candidate oligonucleotide sequence's calculated melting temperature 
by a preselected amount comprising the steps of: 

reducing said calculated melting temperature by 2 degrees Celsius if an 
AT mismatch exists; and 
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reducing said calculated melting temperature by 4 degrees Celsius if a GC 
mismatch exists. 

79. A method for designing candidate oligonucleotide sequences for use with 
a gene sequence data source comprising the steps of: 

introducing user-selected gene sequence and a user-specified sequence 
length into a computer system; 

storing said gene sequence and said sequence length in the memory of the 
computer system; 

accessing gene sequence data from said gene sequence data source; 

accessing the gene sequence source to introduce the user-selected set of 
gene sequence data and a user-selected set of target gene sequence data from said gene 
sequence data source into the computer system; 

comparing said gene sequences against said target gene sequences 
employing said criteria; 

calculating candidate oligonucleotide sequences of said sequence length 
that are either common to a pool of user-specified gene sequences or specific to a 
particular user-specified gene sequence; 

calculating the homology between the candidate oligonucleotide sequences 
and said gene sequence data; 

displaying in multiple dimensions the gene sequences which result from 
the comparisons and calculations characterized in that said display format exhibits: 

the starting position of each candidate oligonucleotide sequence in one 

dimension; 

a candidate oligonucleotide sequence's specificity to the target gene 
sequence in a second dimension; and 

superimposed melting temperatures of gene sequences in contrasting 
presentations in at least an apparent third dimension. 

80. A method in accordance with Claim 79 wherein said method includes the 
step of calculating a candidate oligonucleotide sequence's hairpin characteristics. 

81. A method in accordance with Claim 80 wherein said step of calculating 
hairpin characteristics for a gene sequence comprises: 
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calculating a complementary sequence to the said gene sequence by 
reversing the base pair order of the gene sequence and substituting complementary base 
pairs; 

comparing each character of said original gene sequence and said 
complementary sequence ; 

finding the longest match between said original gene sequence and said 
complementary sequence ; and 

saving the match with the longest hairpin distance if any two matches have 
the same length. 

82. A method in accordance with Claim 79 wherein the step of displaying 
further includes producing a cursor moveable along one dimension of said display that 
selects a position for an expansion of data representing the homology between the 
candidate oligonucleotide sequences and said gene sequence data; and 

displaying in alphanumeric form the homology between the candidate 
oligonucleotide sequences and said gene sequence data. 

83. A method in accordance with Claim 79 wherein said display format data 
is outputted to a printing means. 

84. A method in accordance with Claim 79 wherein said display format data 
is saved to a data file. 

85. A method in accordance with Claim 79 wherein said display format data 
is exported to another computer system. 

86. A method in accordance with Claim 79 wherein the step of displaying 
further includes producing a cursor moveable along one dimension of said display that 
selects a position for an expansion of data representing the homology between the 
candidate oligonucleotide sequences and said gene sequence data; 

positioning said moveable cursor to select and save particular candidate 
oligonucleotide sequence information; and 

displaying in alphanumeric form the homology between the candidate 
oligonucleotide sequences and said gene sequence data. 
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87. A method in accordance with Claim 79 wherein the step of displaying 
further includes producing a cursor moveable along one dimension of said display that 
selects a position for an expansion of data representing the homology between the 
candidate oligonucleotide sequences and said gene sequence data; 

positioning said moveable cursor to select and save particular candidate 
oligonucleotide sequence information; 

capturing candidate oligonucleotide sequence information at the user- 
selected point and storing said information in said memory means; and 

displaying in alphanumeric form the homology between the candidate 
oligonucleotide sequences and said gene sequence data. 

88. A method in accordance with Claim 79 wherein said method of displaying 
comprises: 

calculating display output ranges; 

converting said output ranges to a logarithmic scale; 

interpolating said converted values; 

creating a bitmap of said interpolations; and 

displaying said bitmap on a display device. 



A method in accordance with Claim 79 wherein said method of displaying 

converting said result values to pixels; 
filling a pixel array with said pixels; 
performing a binary search into said pixel array; 

determining the number of pixels per candidate oligonucleotide sequence 
to be displayed; 

interpolating said pixels at the value of pixels per position minus one; 
computing an array of said pixel array; and 
plotting the results on a display device. 

90. A method to determine hybridization strength between two or more gene 
sequences for use with a gene sequence data source, comprising the steps of: 

accessing gene sequence data from said gene sequence data source; 



89. 

comprises: 
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comparing base pairs of a first gene sequence and a second gene sequence 
to determine if a match exists; 

incrementing said first gene sequence's bound strength by some first 
number if a base pair character in said first gene sequence and said second gene 
sequence match and the matched base pair is equal to a combination of the bases 
Guanine (G) and Cytosine (C); 

incrementing said first gene sequence's bound strength by some second 
number if a base pair character in said first gene sequence and said second gene 
sequence match and the matched base pair is equal to a combination of the bases 
Adenine (A) and Thymine (T); 

decrementing said first gene sequence's bound strength by a third number 
if there is no match in base pairs between said first gene sequence and said second gene 
sequence; 

comparing said first gene sequence's bound strength to said first gene 
sequence's unbound strength; 

setting said first gene sequence's unbound strength equal to its bound 
strength if said first gene sequence's bound strength is greater than said first gene 
sequence's unbound strength; and 

resetting said first gene sequence's bound strength to zero if said first gene 
sequence's unbound strength is less than zero. 

91. A method in accordance with Claim 90 wherein said first and second 
numbers are greater than zero. 

92. A method in accordance with Claim 90 wherein said second number is in 
the order of 42% of said first number. 

93. A method in accordance with Claim 90 wherein said third number is in 
the order of 5% larger than said first number. 

94. A method of calculating the minimum length of any matching gene 
subsequence comprising: 

introducing a user-selected maximum number of mismatches and a user- 
selected minimum candidate oligonucleotide sequence length; 
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subtracting said maximum number of mismatches from said minimum 
candidate oligonucleotide sequence length to give a first result; 

dividing said first result by said maximum number of mismatches plus one 
to give a second result; 

incrementing said second result by one if the remainder is not equal to 
zero to give a third result; and 

truncating said third result to an integer. 

95. A method of calculating hairpin characteristics for a gene sequence 
comprising: 

calculating a complementary sequence to the said gene sequence by 
reversing the base pair order of the gene sequence and substituting complementary base 
pairs; 

comparing each character of said original gene sequence and said 
complementary sequence; 

finding the longest match between said original gene sequence and said 
complementary sequence; and 

saving the match with the longest hairpin distance if any two matches have 
the same length. 

96. A method of creating a preparation file from a user- selected set of target 
gene sequence data comprising: 

cutting said target gene sequence data into fixed-length subsequences; and 
storing said subsequences in a preparation file. 

97. A method of creating a preparation file from a user-selected set of target 
gene sequence data comprising: 

cutting said target gene sequence data into fixed-length subsequences in 
the order of 96 base pairs in length; and 

storing said subsequences in a preparation file. 

98. A method in accordance with Claim 97 wherein said fixed-length 
subsequences are calculated by a method comprising the steps of: 
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locating the origin of said subsequence in a set position of said target gene 
sequence in said preparation file; 

cutting a subsequence that is a fixed-length long every preselected number 
of positions of said target gene sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 



99. A method in accordance with Claim 97 wherein said fixed-length 
subsequences are calculated by a method comprising the steps of: 

locating the origin of said subsequence in a set position of said target gene 
sequence in said preparation file wherein the origin of said subsequence is located at 
position 40 of said target sequence in said preparation file; 

cutting a subsequence that is a fixed-length long every preselected number 
of positions of said target gene sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

100. A method in accordance with Claim 97 wherein said fixed-length 
subsequences are calculated by a method comprising the steps of: 

locating the origin of said subsequence the 40th position of said target 
gene sequence in said preparation file; 

cutting a subsequence that is 96 base pairs long of said target gene 
sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

101. A method of forming lists of screens of target gene sequence data 
comprising: 

introducing a user-selected screening threshold; and 
forming subsequences of said target gene sequence data of length equal 
to a user-selected screening threshold. 

102. A method of preprocessing a user-selected set of target gene sequence 
data comprising the steps of: 
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searching for sequences without introns in said target gene sequences; 
extracting target gene sequences that do not contain introns; and 
storing said extracted target gene sequences. 
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65 

AMENDED CLAIMS 

[received by the International Bureau on 4 April 1994 (04-. 04- 94); 
ord-glnal claiim 69 amended; remaining claims unchanged (1 page)] 

68. A method in accordance with Claim 66 wherein said second number is 
in the order of 42% of said first number. 

69. A method in accordance with Claim 66 wherein said third number is in 
the order of 5% larger than said first number. 

70. A method in accordance with Claim 58 wherein said method includes the 
step of calculating the hairpin characteristics of said candidate oligonucleotide sequence. 

71. A method in accordance with Claim 70 wherein the step of calculating the 
hairpin characteristics of said candidate oligonucleotide sequence includes the steps of: 

calculating a complementary sequence to the candidate oligonucleotide 
sequence by reversing the base pair order of the candidate oligonucleotide sequence and 
substituting complementary base pairs; 

comparing each character of said original candidate oligonucleotide 
sequence and said complementary sequence; 

finding the longest match between said original candidate oligonucleotide 
sequence and said complementary sequence; and 

saving the match with the longest hairpin distance if any two matches have 
the same length. 

72. A method in accordance with Claim 58 wherein said fixed-length target 
gene subsequences are calculated by a method comprising the steps of: 

locating the origin of said subsequence in a set position of said target gene 
sequence in said preparation file; 

cutting a subsequence that is a fixed-length long every preselected number 
of positions of said target gene sequence in said preparation file; and 

sorting said subsequences in said preparation file in lexicographical order 
beginning at a set position. 

73. A method in accordance with Claim 72 wherein the origin of said 
subsequence is locaited at position 40 of said target sequence in said preparation file. 



AMENDED SHEET (ARTICLE 19) 
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FIG. 6 A (1) 

PROBE: C:\HITACHI\JUNMIX.PRP 
HYBRIDIZATION: C: \HITACHI\HUMBJUNX . CDS 
Length = 374 Hairpin =35 



Locus Pos Tm 

numb junx 374 61.47 

musbjunx 365 61.47 

humdjunx 41 3 4 .82 t g-g — agt 

numb junx 182 31.12 a gtgg — gc 

humdjunx 602 31.12 c c-ggg-gc 

humdjunx 602 31.12 c c-ggg-gc 



PROBE : C : \HITACHI \ JUNMIX . PRP 
HYBRIDIZATION: C: \HITACHI\HUMBJUNX . CDS 
Length = 377 Hairpin = 2 14 
Locus Pos Tm 

humb junx 3 77 61.55 

musbjunx 368 61.55 

humdjunx 383 28.12 tg-cg-c — g 

musdjunx 383 28.12 tg-ca-c — g 

musdjunx 383 28.12 tg-ca-c — g 

PROBE: C:\HITACHI\JUNMIX.PRP 
HYBRIDIZATION: C: \HITA CHI \HUMB JUNX. CDS 



Length = 389 Hairpin =33 

Locus Pos Tm 

humb junx 3 89 61.7 

muse junx 314 56.65 -c 

musbjunx 3 80 50.85 t — g 

humcjunx 314 4 9.35 -t g 

humdjunx 395 33.85 tt-gc — ag 

musdjunx 395 33.85 tt-gc — aa 

humcjunx 326 3 2.35 g-ttcgee tg 

humdjunx 4 04 3 2.35 — ttcgee t- 

muscjunx 326 3 2.35 gcttcgcc tg 

musdjunx 253 30.85 gacg-gct-ct 

humb junx 953 30.65 g 1 — c-cagct- 

musdjunx 8 3 2 7.3 cc-geggt-gt g 
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FIG. 6A (2) 

PRO BE : C : \H I T ACH I \ JUNMI X . PRP 
HYBRIDIZATION: C : \HITACHI\HUMBJUNX . CDS 
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PROBE: C:\HITACHI\JUNMIX.PRP 
HYBRIDIZATION: C : \HITACHI \HUMBJUNX . CDS 
Length =417 Hairpin = 2 15 
Locus Pos Tm 

humbjunx 417 60.08 

musbjunx 408 55.52 c 

humdjunx 420 37.3 c g g t-a- 

musbjunx 61 29.0 g gg ca-cctgt- 

muscjunx 672 26.27 gc-gc a-g — aga — 
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FIG. 6A (3) 

PROBE: C:\HITACHI\JUNMIX.PRP 
HYBRIDIZATION: C : \HITACHI\HUMBJUNX . CDS 
Length =4 61 Hairpin = 49 
Locus Pos Tm 

humbjunx 4 61 61.63 

musbjunx 452 61.63 

musbjunx 452 61.63 

PROBE : C : \HITACHI\JUNMIX . PRP 
HYBRIDIZATION : C : \HITACHI\HUMBJUNX . CDS 
Length =4 67 Hairpin = 2 13 



Locus Pos Tm 

humbjunx 467 61.7 

musbjunx 458 51.6 c-g- 

humdjunx 32 29.35 tgagcgg gcgg- 

humdjunx 32 29.35 tgagcgg gcgg- 



PRO BE : C : \H I T ACH I \ JUNMI X . PRP 
HYBRIDIZATION : C : \HITACHI\HUMBJUNX . CDS 
Length = 477 Hairpin =24 
Locus Pos Tm 

humbjunx 477 61.37 

humdjunx 489 34.93 c-c eg 

humdjunx 4 89 34.93 c-c eg 

PROBE : C : \HITACHI\JUNMIX . PRP 
HYBRIDIZATION : C : \HITACHI\HUMBJUNX . CDS 



Length = 4 87 Hairpin =33 
Locus Pos Tm 

humbjunx 487 61.14 

musdjunx 74 51.0 ct 

humdjunx 499 45.64 t g 

humdjunx 527 30.72 cc-c-c 

musdjunx 97 30.72 ttc-c g 

musdjunx 580 30.72 -cc t-g 

musdjunx 637 30.72 cc-cc g 

musdjunx 637 30.72 cc-cc g 
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FIG. 6A (4) 



PROBE: C:\HITACHI\JUNMIX.PRP 
HYBRIDIZATION: C : \HITACHI \HUMBJUNX . CDS 
Length = 498 Hairpin =32 
Loqus Pos Tm 

humbjunx 4 98 61.2 6 

humbjunx 4 98 61.26 

PROBE: C:\HITACHI\JUNMIX.PRP 

HYBRI DI Z ATION : C : \HITACHI \HUMB JUNX . CDS 

Length = 504 Hairpin =32 

Locus Pos Tm 

humbjunx 504 61.47 

musbjunx 495 40.35 c — a- 1- 

humdjunx 609 35.29 eg egggg- 

humdjunx 609 35.29 eg egggg- 
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( Begin Stage I fds.cpp) ) 
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Open user-selected 
PRP file 



Read PRP sequence 
into memory 
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FIG. 22 
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target sequence file 



Read each target 
sequence into memory 
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screens in memory 








Move pointer one base 
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on the target 
sequence 
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Increment boundStrength 
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